A SIMPLE BLOG
JUNE 14, 2025 | RUST, MEMORY, OPTIMIZATION
I created my own Box
-like type that avoids an allocation by storing the value in-place if small enough.
I have seen plenty of crates with short-string optimizations, but I hadn't encountered the same for the general
case. Likely because the features to support it are unstable and would require using the nightly
compiler, but I wanted to see what it would take.
In case you have no idea what I'm talking about, a Box
in Rust points to a
separate memory allocation to hold a value and provides ownership semantics for it - i.e. it will clean-up after
itself when it goes out of scope by dropping the value and freeing the allocated memory. The act of acquiring and
releasing that memory can have a non-negligible cost and thus boxing a value is generally avoided unless necessary.
A pattern that often necessitates a Box
is the unification of trait objects. Polymorphism can be
achieved in Rust by implementing traits and using dynamic dispatch via dyn Trait
. This is a type that
obscures the concrete type behind it leaving only method accessible through the trait, but because it can represent
any implementation which can be of varying sizes it is an unsized type. Unsized
types are difficult to work with and the go-to wrapper is Box<dyn Trait>
since the box
will have a statically known size even if the value does not.
So boxing is often necessary for trait objects, but what if we could avoid the allocation cost? The Box
needs to hold a pointer to the data, but what if the value is smaller than the pointer pointing to it? We could just
store the value where we'd keep that pointer instead and skip the allocator. Well that's the idea behind the
small-size optimization.
Quick side note: the standard Box
somewhat does this optimization already, but only for zero-sized
types. Obviously no memory needs to be allocated for a type that has no size, so the creation of the
Box
will short-circuit and just set the pointer to a dummy address.
The ssobox crate is published so you can try it for yourself:
use ssobox::SsoBox;
let debuggables: [SsoBox<dyn Debug>; 5] = [
SsoBox::new_unsized(()),
SsoBox::new_unsized(1.0),
SsoBox::new_unsized([42.0, 99.9]),
SsoBox::new_unsized("test test test"),
SsoBox::new_unsized(vec![1, 2, 3, 4]),
];
for (idx, item) in debuggables.iter().enumerate() {
let inhabits = if SsoBox::inhabited(&item) { "T" } else { "F" };
println!(
"{idx} {inhabits} {:018p} - {:?}",
item.as_ref() as *const dyn Debug as *const (),
item.as_ref(),
);
}
0 T 0x000000556a9fef48 - ()
1 T 0x000000556a9fef60 - 1.0
2 T 0x000000556a9fef78 - [42.0, 99.9]
3 T 0x000000556a9fef90 - "test test test"
4 F 0x000001a6cc0d5350 - [1, 2, 3, 4]
In this simple demonstration, only the last value holding a Vec
is allocated. All the other values are
small enough to be stored in-place.
Lets start with the type definition as our foundation:
pub struct SsoBox<T: ?Sized> {
meta: <T as Pointee>::Metadata,
data: SsoBoxData,
}
union SsoBoxData {
ptr: *const (),
buf: MaybeUninit<[*const (); 2]>,
}
First, yes this is a rare instance of union
in Rust. It allows us to store either a pointer or a buffer
in which to store other values. We just have to keep track ourselves which one is being used. In the current
implementation, I opted to make the buffer larger than a single pointer. This makes an SsoBox
larger
than the equivalent Box
but it ends up the same size as a String
or Vec
.
Next, lets explain the <T as Pointee>::Metadata
part. Pointee
is an auto trait
that describes how a pointer to that type is structured. For the uninitiated, all kinds
of pointers may be "fat" if the type is unsized. You can imagine that
all pointer types have some Pointee::Metadata
attached to them, but for the case of
Sized
types, this is just ()
- a.k.a. nothing. For slices, the pointer stores the length
of the slice as its metadata. And for trait objects, the pointer stores a vtable used to dispatch calls for the
concrete type. Storing the <T as Pointee>::Metadata
on its own allows us to essentially separate
the raw data pointer from its metadata so that we can be flexible how we reconstruct it.
A question you may be asking: How do you determine whether the value is stored in-place or via an allocation? And how do you do determine that before knowing where the data pointer should point?
unsafe fn inhabitable<T: ?Sized>(meta: <T as Pointee>::Metadata) -> bool {
let value_layout = Layout::for_value_raw(std::ptr::from_raw_parts(&(), meta));
let buffer_layout = Layout::new::<[*const (); 2]>();
value_layout.size() <= buffer_layout.size() && value_layout.align() <= buffer_layout.align()
}
The final check is self-explanatory, but you may notice some trickery in the way I've calculated the value's
layout. The metadata should be reflective of the value being stored, but the data pointer I've simply manifested
from &()
. I can do this because the safety requirements for for_value_raw
,
make no mention that the data pointer needs to be valid. And conceptually, it shouldn't need to be - the
size of a trait object is available through the vtable pointer, not the value itself - and the size of the slice is
calculated from the length (i.e. the metadata) and the statically known size of the elements.
You can view the full implementation in the repository.
This doesn't come for free. The SSO box now has a condition to check whenever an unsized value is accessed from
the box. With a standard Box
, its representation is equivalent to that of a reference, so deref-ing it
is essentially a no-op. Whereas with SSO box, some deduction is needed to determine whether the value was stored
in-place or on the heap:
#[inline(never)]
pub fn deref_trait_demo(sso: &SsoBox<dyn Display>) -> &dyn Display {
&**sso
}
deref_trait_demo:
mov rdx, qword ptr [rcx]
mov r8, qword ptr [rcx + 8]
lea rax, [rcx + 8]
cmp qword ptr [rdx + 16], 9 # check alignment
cmovae rax, r8
cmp qword ptr [rdx + 8], 17 # check size
cmovae rax, r8
ret
I have some micro-benchmarks to try to quantify the costs. Lets start with access - this shows the time taken to call a trait method that returns a constant based on its size:
| Box | SsoBox | diff
empty | 1.2634 ns | 1.2727 ns | +0.74%
small | 1.2624 ns | 1.2704 ns | +0.63%
large | 1.2651 ns | 1.2730 ns | +0.62%
varied | 1.3725 ns | 1.8443 ns | +34.4%
I'm unsure exactly how the difference seems non-existent on the fixed size benchmarks. I guess its from
the CPU being clever with multiple iterations of the same thing, so the "varied" benchmark uses different
sized traits in rapid-succession to try to get the worst case scenario. So in the worst case, accessing an
SsoBox
takes about half a nanosecond extra - or about 2 clock cycles on my system. Not bad.
But where the small-size optimization shines is by skipping memory allocations, so lets look at the time to create
and drop an SsoBox
of varying sizes:
| Box | SsoBox | diff
empty | 0.8503 ns | 0.6374 ns | -25.0%
small | 30.505 ns | 1.2690 ns | -95.8%
large | 31.867 ns | 31.010 ns | -2.69%
The "small" case where the value would normally be allocated is night and day! This is using the default Windows allocator, and different allocators may have different costs, but its clear that it cuts that cost completely out of the equation.
I'm not 100% sure how the "empty" case (a.k.a. a zero-sized value) is faster than the standard
Box
. My guess is that SsoBox
leaves the rest of the box uninitialized while a
Box
has to write a dummy pointer. Maybe that saves a clock cycle.
Micro-benchmarks are easy to misinterpret, so take the exact numbers with a grain of salt. The benchmarks are available in the repository so you can inspect them or run them for yourself.
A small ergonomic annoyance with SsoBox
is you can't create one with an unsized type quite like you
can with a Box
. If you wanted a Box<dyn Trait>
, you would first create a
Box
with a concrete value and then cast it like this:
Box::new(MyStruct::new()) as Box<dyn Trait>
There is an unstable CoerceUnsized
trait
that can be used to provide unsized coercions for
user-defined types. However, it requires a "foundational" type that already implements
CoerceUnsized
for it to use and <T as Pointee>::Metadata
is not one of them. It
would be nice if the compiler realized that T::Metadata
can be coerced into U::Metadata
if
T: Usize<U>
, but that may be a stretch.
So to aid in ergonomics, SsoBox
has a new_usized
method to go directly from a concrete
value to an unsized box. There is also an into_unsized
method to do the same if the concrete
SsoBox
was already created.
It is a shame SsoBox
can't be a drop-in replacement due to this.
Unfortunately, even though intended for use with trait objects, an SsoBox
isn't that helpful for
dyn Future
because of Pin
. Pin
is a wrapper around pointer types used to
provide guarantees that the value pointed to doesn't move. Polling a future requires it to be pinned, but
Pin<SsoBox<dyn Future>>
cannot guarantee that the value doesn't move - moving
an SsoBox
would mean moving the value if it were stored in-place.
However, Pin<SsoBox<dyn Future + Unpin>>
is a workable pattern. If a future is
Unpin
then it can be stored directly, but if it is not then boxing the future will ensure it stays
pinned and that box can be stored in-place in the SsoBox
. Boxes within boxes. So it could still be
beneficial if a portion of your tasks are small and Unpin
, but from my experience that only tends to be
the case with hand-rolled futures and unlikely with async
. So your mileage may vary.
Crates like smallvec and compact-str have been around for a long time that can store slices in-place and dynamically transition to an allocation if needed, but the layouts of slices are simple and stable. Handling unsized types in general introduces totally different concerns.
In my research, I did come across the static-box crate which on its
face sounds like it does something similar, but it is largely just std::ptr::write
that also writes the
metadata for trait objects. There is plenty of overlap, since the fundamental problem is around pointer metadata.
You can also find discussions of people thinking to use allocators directly in a Box<_, A>
to
provide a similar effect. However by the wording of Allocator
that would be unsafe, and I'm not
sure that would be possible with Box
's current design around pointer stability. It would need a big
retrofit to make that work.
What I have implemented is workable but I think some additions would make sense.
I will likely incorporate custom sizes for the internal buffer. Holding two machine words seemed like a happy medium
in my opinion. But its easy to see that some may desire a single machine word so that it matches the existing size
of a Box
, or having a much larger buffer based on their expected values if the bloated size of the box
is an acceptable compromise.
It also might make sense to also incorporate custom allocator support - also unstable. Environments with custom allocators may have more desire for minimizing costs, but also custom allocators can be much cheaper than general allocators so the gains might not be so large. I'm not sure.
I provided many trait implementations that the standard library provides for their Box
, but I skipped
many of them that were singularly for str
or slices. PRs would be welcome if they reflect what the
standard library provides.
I'm pretty happy with the implementation. It mostly works just like a normal Box
just with a new
shortcut that saves on performance without costing much. Miri has been used throughout the development process to
ensure no obvious undefined behavior is being relied on.
I do wish it were on stable though. Here are the unstable features that are used:
layout_for_ptr
- this enables the
inhabitance check to get layout information using only a pointerptr_metadata
- this is what allows you
to split fat pointers apart and access their metadataunsize
- this provides constraints for
generically performing unsizing coercionsI hope they get stabilized eventually.
Anyways, thanks for reading! Hope you found it insightful or at least interesting. I wish you all well on your own
unsafe
adventures.