The RPC protocol uses flatbuffers to make cross language support possible, in addition to
its main goal of providing zero cost deserialization on the receiving side.
Currently, Go and Java are in the process of being added to the repo,
with partial support already merged.
On Friday, June 28 2018, I got a text from a friend
that said: last time I checked flatbuffers was slow vs cap 'n proto.
At first, I was suspicious, since I’ve never seen flatbuffers be
in any top $> perf profile of any kind, on real production applications. However,
I had no numbers or benchmarks to prove it.
I thought I had a quick-and-dirty hack to amortize the cost of object graphs -
it never occurred to me to measure large type trees or large buffers.
BLUF - Bottom Line Up Front.
I had introduced a performance optimization that actually turned out to be
bad for large buffers.
To be precise:
On small very buffers it was ~6% faster
On large buffers it was ~41% slower
nooooooooooooooooooooooooooo!
The beginning of yak-shaving
I found a project that had benchmarked
cap'n proto vs flatbuffers. In particular, it measures buffer construction,
since both cap’n proto and flatbuffers deserialization is effectively a pointer cast - Yay
for little endian enforced encoding into an aligned byte array.
The cpp-serializers project
measures cap’n proto vs flatbuffers encoding and to my surprise,
flatbuffers measured at 2.5X slower.
As you can see, the benchmark creates a flatbuffers builder (a kind of std::vector). It stresses
the builder by allocating an array of strings and ints… and that’s about it.
Internally, the flatbuffers::FlatbufferBuilder encodes from top-to-bottom (downward growth) and when
it reaches the bottom, it reallocs and grows the size of the underlying array to fit the contents.
The actual code for reallocation is:
It grows by powers of 2 (reserved_ + buffer_minalign_ - 1) & ~(buffer_minalign_ - 1);
, and usually by 1024 bytes which is the default initial_size_;
Setup
Let’s start with a Flatbuffers IDL of a simple key=value struct.
The C++ API has a nice object api on top of the raw flatbuffers::FlatbufferBuilder, so
a quick GoogleBenchmark yields something like this:
A few things to note:
We use auto mem = bdr.Release(); and wrap that into a seastar::temporary_buffer<char> with
effectively zero copy (+- some pointer assignments).
This is important because our entire messaging API is about bridging the code generation from
flatbuffers and seastar.
In addition to this, our RPC mechanism codegen’s seastar <–> flatbuffers glue code.
To encode a 2 x 64KB buffers it takes 6.8 micros. Yikes! - But is this bad?
(as of flatbuffers checkin 34cb163e389e928db08ed2bd0e16ee0ac53ab1ce).
Note this is only 3X the cost of std::malloc + std::memset, which is the fastest
thing I can think of as a base comparison.
Before we dive into possible optimizations, let’s fix our desktop for these
micro optimizations.
Fixing the environment
My CPU is a
Intel(R) Xeon(R) CPU E3-1535M v6 @ 3.10GHz
with a turbo of 4.20GHz.
That’s great for desktop experience where interactivity matters and terrible
for performance benchmarking.
When I first tried to show the results to my partner
(non CS Major, but puts up w/ me asking her to stare at my screen) I had
unpredictable results.
Armed w/ the shell funcs above, my checklist is now as follows:
Settings
Ensure that your BIOS says performance when connected to AC
Check your CPU freqencies via cat /proc/cpuinfo | grep model
Set your CPU governor via cpu_enable_performance_cpupower_state
Set the min frequency to the frequency reported by your CPU/model/vendor via: cpu_set_min_frequencies 3100000 in my case
Set the max frequency to the frequency reported by your CPU/model/vendor via: cpu_set_max_frequencies 3100000 in my case
Verify that you always build in Release mode
Verying the frequencies is now simple via cpu_available_frequencies
The take away is you don’t necessarily want speed, you want predictability.
Next - Compiler Explorer - my new fav tool
First, let’s execute the commands above and get the compiler flags that we’ll need for compiler explorer.
If you are using CMake, follow these screencasts and don’t forget to set
set(CMAKE_EXPORT_COMPILE_COMMANDS 1) on your main CMakeLists.txt file
Now that we have the compiler flags ready, let’s fire up compiler explorer:
When you navigate to localhost:1024 you are welcomed with the usual and friendly CompilerExplorer UI.
With these tools we can now benchmark & optimize our code
The results
As expected, for small buffers, there is a large cost vs the base of malloc+memset, and
at the higher ends, the allocation & byte traversal start to dominate.
The range is ~58x (worst) - ~2.8X (best).
Our benched code
Lessons Learned: Benchmarking with turbo is dishonest.
Not that people writing OSS (or any software really) are out there to get you.
It is just easy to forget to tune your machine specifically for benchmarking.
There is no such thing as a quick and dirty benchmark especially if the
results are not categorically different, i.e.: 1 minute vs 10 mins vs 1hr.
Let’s compare the stability of multiple runs with turbo vs without.
That’s a stddev difference of up to 768X !!!
No-turbo means precisely this:
SMF
smf just got 40% faster for large buffers thanks to these benchmarks,
if you give it a shot let me know. Stay tuned
for the Java and Go code generators and performance benchmarks coming soon.
Let me know if you found this useful, any missinformation, or additional performance tunning
on twitter @emaxerrno.
Special thanks to my partner Sarah Rohrbach as well as Chris Heller, Noah Watkins
for reading earlier drafts of this post.