Runtime Performance

bitvec increases the instruction cost of each access to a bool in its data structures. This is an inevitable consequence of the fact that, even on architectures that have them, compilers typically do not emit object code instructions that access individual bits within a memory location. Therefore, each access in bitvec has, in addition to a memory operation, one or two shift instructions one or two AND/OR/NOT instructions.

This means that, inevitably, bitvec is slower in CPU time than [bool] is. Measurements indicate roughly a factor of ten, but with also about 10x more variance. However, this cost is only apparent and meaningful when walking the entirety of very large buffers, and tends to fade into noise on smaller buffers, or be obviated by compile-time fixed accesses. As always, try it on a representative workload.

Benchmarking

I have tried (with admittedly low priority) to have some benchmarks in the project to back up my claims that bitvec is fast. This has been difficult to maintain for a few reasons, but I have at least a few that have stayed present in order to demonstrate important claims, such as showing that specialization for matching types does provide massive performance benefits (it does).

In particular, LLVM is very good at propagating compile-time constants through the code, and bitvec strives to maintain an internal implementation that is easily accessible to optimizers. This means that basically any benchmark that takes input from a source file that the compiler can see gets artificially solved during codegen.

For instance, I don’t know how long it takes to construct a &BitSlice view over memory, because my benchmarks report 0ns: LLVM computes my pointer encoding at compile time, and a consequence of the way I designed my pointer encoding is that the only operation BitSlice::from_slice actually performs is .len << 3. When LLVM can see the original length, it just does this itself, and emits an immediate with the correct length instead of the constructor call.

Other constructor benchmarks are only showing me the time required to run memcpy, and arbitrary indexing just shows the time required to run three instructions, because LLVM solved the shift/mask arguments ahead of time.

The important takeäway here is that if your code is at all dependent on constants that the compiler can see, and is not exclusively performing indexing based on runtime inputs, then bitvec is going to be plenty fast.

Pitfalls

Everything stated above relies on having information available to the compiler. bitvec falls behind other bit-reading libraries when you start using access patterns only known at runtime, such as iteration.

Until Rust stabilizes the ptr_metadata feature (see #81513), bitvec will necessarily take a performance hit because it has to decode the &BitSlice pointer every time you access memory, and reëncode it every time you munch through the region.

The bitvec pointer encoding (described here) requires manipulating both words in the pointer with at least three instructions each.

Except when using `u8` storage, which discards *most* of the modifications to
the address half of the pointer. Try that out if you do a lot more munching
through a region than bulk work on its contents!

This would not be necessary if the pointer could use its own metadata type rather than usize. Until that stabilizes, the entire value proposition of the crate rests on the fact that &BitSlice has slice-pointer ABI. If you absolutely cannot afford to have decoding/reëncoding costs in your hot loop, you may have to try other libraries.

bitvec strives to use batch operations on entire integer registers when possible. However, doing so requires routing through the domain module, which has to perform similar decoding and processing of a &BitSlice pointer every time it is entered. This is another unavoidable cost. It is generally a tolerable overhead compared to walking through each bit individually, especially with wider storage integers, but it is one more thing that other bit-addressing libraries don’t do.

Other libraries also do not have alias safety or tolerance for starting a region away from the zero-index in an integer. Power has a price. We do our best to cut down the library’s runtime cost as much as possible, but this computation simply has to be done somewhere, and it's not built in to silicon anymore. Sorry.

Specialization

bitvec strives to use its knowledge of the underlying memory representation wherever possible. This means that operations between BitSlices with the same type parameters can rely on an identical representation and use integer behavior rather than walking each bit individually.

Try to do this wherever possible, especially in performance-sensitive code. You typically should not be mixing bitvec structures with different type parameters anyway: use the representation that serves your needs, and keep using it in all buffers that interact with it.

As an example, as of 1.0, walking two large BitSlices with incompatible type parameters takes 3µs (microseconds), while walking the same sizes with identical type parameters takes 100ns (nanoseconds). It’s roughly a 32x performance difference, which is only half the speedup that I expected using usize on a 64-bit machine, but still quite stark.