A Simple Small-size Optimized Box

JUNE 14, 2025 | RUST, MEMORY, OPTIMIZATION

I created my own Box-like type that avoids an allocation by storing the value in-place if small enough. I have seen plenty of crates with short-string optimizations, but I hadn't encountered the same for the general case. Likely because the features to support it are unstable and would require using the nightly compiler, but I wanted to see what it would take.

Background

In case you have no idea what I'm talking about, a Box in Rust points to a separate memory allocation to hold a value and provides ownership semantics for it - i.e. it will clean-up after itself when it goes out of scope by dropping the value and freeing the allocated memory. The act of acquiring and releasing that memory can have a non-negligible cost and thus boxing a value is generally avoided unless necessary.

A pattern that often necessitates a Box is the unification of trait objects. Polymorphism can be achieved in Rust by implementing traits and using dynamic dispatch via dyn Trait. This is a type that obscures the concrete type behind it leaving only method accessible through the trait, but because it can represent any implementation which can be of varying sizes it is an unsized type. Unsized types are difficult to work with and the go-to wrapper is Box<dyn Trait> since the box will have a statically known size even if the value does not.

So boxing is often necessary for trait objects, but what if we could avoid the allocation cost? The Box needs to hold a pointer to the data, but what if the value is smaller than the pointer pointing to it? We could just store the value where we'd keep that pointer instead and skip the allocator. Well that's the idea behind the small-size optimization.

Quick side note: the standard Box somewhat does this optimization already, but only for zero-sized types. Obviously no memory needs to be allocated for a type that has no size, so the creation of the Box will short-circuit and just set the pointer to a dummy address.

Demo

The ssobox crate is published so you can try it for yourself:

use ssobox::SsoBox;

let debuggables: [SsoBox<dyn Debug>; 5] = [
    SsoBox::new_unsized(()),
    SsoBox::new_unsized(1.0),
    SsoBox::new_unsized([42.0, 99.9]),
    SsoBox::new_unsized("test test test"),
    SsoBox::new_unsized(vec![1, 2, 3, 4]),
];

for (idx, item) in debuggables.iter().enumerate() {
    let inhabits = if SsoBox::inhabited(&item) { "T" } else { "F" };

    println!(
        "{idx} {inhabits} {:018p} - {:?}",
        item.as_ref() as *const dyn Debug as *const (),
        item.as_ref(),
    );
}

0 T 0x000000556a9fef48 - ()
1 T 0x000000556a9fef60 - 1.0
2 T 0x000000556a9fef78 - [42.0, 99.9]
3 T 0x000000556a9fef90 - "test test test"
4 F 0x000001a6cc0d5350 - [1, 2, 3, 4]

In this simple demonstration, only the last value holding a Vec is allocated. All the other values are small enough to be stored in-place.

Implementation

Lets start with the type definition as our foundation:

pub struct SsoBox<T: ?Sized> {
    meta: <T as Pointee>::Metadata,
    data: SsoBoxData,
}

union SsoBoxData {
    ptr: *const (),
    buf: MaybeUninit<[*const (); 2]>,
}

First, yes this is a rare instance of union in Rust. It allows us to store either a pointer or a buffer in which to store other values. We just have to keep track ourselves which one is being used. In the current implementation, I opted to make the buffer larger than a single pointer. This makes an SsoBox larger than the equivalent Box but it ends up the same size as a String or Vec.

Next, lets explain the <T as Pointee>::Metadata part. Pointee is an auto trait that describes how a pointer to that type is structured. For the uninitiated, all kinds of pointers may be "fat" if the type is unsized. You can imagine that all pointer types have some Pointee::Metadata attached to them, but for the case of Sized types, this is just () - a.k.a. nothing. For slices, the pointer stores the length of the slice as its metadata. And for trait objects, the pointer stores a vtable used to dispatch calls for the concrete type. Storing the <T as Pointee>::Metadata on its own allows us to essentially separate the raw data pointer from its metadata so that we can be flexible how we reconstruct it.

A question you may be asking: How do you determine whether the value is stored in-place or via an allocation? And how do you do determine that before knowing where the data pointer should point?

unsafe fn inhabitable<T: ?Sized>(meta: <T as Pointee>::Metadata) -> bool {
    let value_layout = Layout::for_value_raw(std::ptr::from_raw_parts(&(), meta));
    let buffer_layout = Layout::new::<[*const (); 2]>();

    value_layout.size() <= buffer_layout.size() && value_layout.align() <= buffer_layout.align()
}

The final check is self-explanatory, but you may notice some trickery in the way I've calculated the value's layout. The metadata should be reflective of the value being stored, but the data pointer I've simply manifested from &(). I can do this because the safety requirements for for_value_raw, make no mention that the data pointer needs to be valid. And conceptually, it shouldn't need to be - the size of a trait object is available through the vtable pointer, not the value itself - and the size of the slice is calculated from the length (i.e. the metadata) and the statically known size of the elements.

You can view the full implementation in the repository.

Performance

This doesn't come for free. The SSO box now has a condition to check whenever an unsized value is accessed from the box. With a standard Box, its representation is equivalent to that of a reference, so deref-ing it is essentially a no-op. Whereas with SSO box, some deduction is needed to determine whether the value was stored in-place or on the heap:

#[inline(never)]
pub fn deref_trait_demo(sso: &SsoBox<dyn Display>) -> &dyn Display {
    &**sso
}

deref_trait_demo:
    mov rdx, qword ptr [rcx]
    mov r8, qword ptr [rcx + 8]
    lea rax, [rcx + 8]
    cmp qword ptr [rdx + 16], 9    # check alignment
    cmovae rax, r8
    cmp qword ptr [rdx + 8], 17    # check size
    cmovae rax, r8
    ret

I have some micro-benchmarks to try to quantify the costs. Lets start with access - this shows the time taken to call a trait method that returns a constant based on its size:

        | Box        | SsoBox     | diff
empty   | 1.2634 ns  | 1.2727 ns  | +0.74%
small   | 1.2624 ns  | 1.2704 ns  | +0.63%
large   | 1.2651 ns  | 1.2730 ns  | +0.62%
varied  | 1.3725 ns  | 1.8443 ns  | +34.4%

I'm unsure exactly how the difference seems non-existent on the fixed size benchmarks. I guess its from the CPU being clever with multiple iterations of the same thing, so the "varied" benchmark uses different sized traits in rapid-succession to try to get the worst case scenario. So in the worst case, accessing an SsoBox takes about half a nanosecond extra - or about 2 clock cycles on my system. Not bad.

But where the small-size optimization shines is by skipping memory allocations, so lets look at the time to create and drop an SsoBox of varying sizes:

        | Box        | SsoBox     | diff
empty   | 0.8503 ns  | 0.6374 ns  | -25.0%
small   | 30.505 ns  | 1.2690 ns  | -95.8%
large   | 31.867 ns  | 31.010 ns  | -2.69%

The "small" case where the value would normally be allocated is night and day! This is using the default Windows allocator, and different allocators may have different costs, but its clear that it cuts that cost completely out of the equation.

I'm not 100% sure how the "empty" case (a.k.a. a zero-sized value) is faster than the standard Box. My guess is that SsoBox leaves the rest of the box uninitialized while a Box has to write a dummy pointer. Maybe that saves a clock cycle.

Micro-benchmarks are easy to misinterpret, so take the exact numbers with a grain of salt. The benchmarks are available in the repository so you can inspect them or run them for yourself.

Coercion

A small ergonomic annoyance with SsoBox is you can't create one with an unsized type quite like you can with a Box. If you wanted a Box<dyn Trait>, you would first create a Box with a concrete value and then cast it like this:

Box::new(MyStruct::new()) as Box<dyn Trait>

There is an unstable CoerceUnsized trait that can be used to provide unsized coercions for user-defined types. However, it requires a "foundational" type that already implements CoerceUnsized for it to use and <T as Pointee>::Metadata is not one of them. It would be nice if the compiler realized that T::Metadata can be coerced into U::Metadata if T: Usize<U>, but that may be a stretch.

So to aid in ergonomics, SsoBox has a new_usized method to go directly from a concrete value to an unsized box. There is also an into_unsized method to do the same if the concrete SsoBox was already created.

It is a shame SsoBox can't be a drop-in replacement due to this.

Pinning

Unfortunately, even though intended for use with trait objects, an SsoBox isn't as nice for dyn Future because of Pin. Pin is a wrapper around pointer types used to provide guarantees that the value pointed to doesn't move. Polling a future requires it to be pinned, but Pin<SsoBox<dyn Future>> can't be created unless the value is Unpin since it cannot guarantee that the value doesn't move - moving an SsoBox would mean moving the value if it were stored in-place. If you wanted to .await an SsoBox<dyn Future>, you'd have to pin it yourself either to the stack (or put away in a Box but that would defeat the purpose).

If you wanted to .await without hassle, it'd have to be Pin<SsoBox<dyn Future + Unpin>>, which isn't an unreasonable pattern. Small Unpin futures can be stored in-place while !Unpin futures can first be boxed for stability. Boxes within boxes. So its possible, but from my experience it seems unlikely to be beneficial especially if you're using async.

Prior Art

Crates like smallvec and compact-str have been around for a long time that can store slices in-place and dynamically transition to an allocation if needed, but the layouts of slices are simple and stable. Handling unsized types in general introduces totally different concerns.

EDIT: It has been brought to my attention that the smallbox crate reaches for the same goal while also being available on the stable compiler. The implementation there would incur a pointer's worth of extra space when the value is stored in-place due to using a full *const T, so my implementation still has that optimization going for it. It looks like the go-to crate for use on stable, but its ergonomics would also be better with the nightly unsize feature.

You can also find discussions of people thinking to use allocators directly in a Box<_, A> to provide a similar effect. However by the wording of Allocator that would be unsafe, and I'm not sure that would be possible with Box's current design around pointer stability. It would need a big retrofit to make that work.

Future Work

What I have implemented is workable but I think some additions would make sense.

I will likely incorporate custom sizes for the internal buffer. Holding two machine words seemed like a happy medium in my opinion. But its easy to see that some may desire a single machine word so that it matches the existing size of a Box, or having a much larger buffer based on their expected values if the bloated size of the box is an acceptable compromise.

It also might make sense to also incorporate custom allocator support - also unstable. Environments with custom allocators may have more desire for minimizing costs, but also custom allocators can be much cheaper than general allocators so the gains might not be so large. I'm not sure.

I provided many trait implementations that the standard library provides for their Box, but I skipped many of them that were singularly for str or slices. PRs would be welcome if they reflect what the standard library provides.

Conclusion

I'm pretty happy with the implementation. It mostly works just like a normal Box just with a new shortcut that saves on performance without costing much. Miri has been used throughout the development process to ensure no obvious undefined behavior is being relied on.

I do wish it were on stable though. Here are the unstable features that are used:

layout_for_ptr - this enables the inhabitance check to get layout information using only a pointer
ptr_metadata - this is what allows you to split fat pointers apart and access their metadata
unsize - this provides constraints for generically performing unsizing coercions

I hope they get stabilized eventually.

Anyways, thanks for reading! Hope you found it insightful or at least interesting. I wish you all well on your own unsafe adventures.