I build and maintain smf - the fastest RPC in the west. The main language supported is C++. We use Seastar as the actual asynchronous framework.

The RPC protocol uses flatbuffers to make cross language support possible, in addition to its main goal of providing zero cost deserialization on the receiving side. Currently, Go and Java are in the process of being added to the repo, with partial support already merged.

On Friday, June 28 2018, I got a text from a friend that said: last time I checked flatbuffers was slow vs cap 'n proto. At first, I was suspicious, since I’ve never seen flatbuffers be in any top $> perf profile of any kind, on real production applications. However, I had no numbers or benchmarks to prove it.

I thought I had a quick-and-dirty hack to amortize the cost of object graphs - it never occurred to me to measure large type trees or large buffers.

BLUF - Bottom Line Up Front.

I had introduced a performance optimization that actually turned out to be bad for large buffers.

To be precise:

On small very buffers it was ~6% faster
On large buffers it was ~41% slower

nooooooooooooooooooooooooooo!

The beginning of yak-shaving

I found a project that had benchmarked cap'n proto vs flatbuffers. In particular, it measures buffer construction, since both cap’n proto and flatbuffers deserialization is effectively a pointer cast - Yay for little endian enforced encoding into an aligned byte array.

The cpp-serializers project measures cap’n proto vs flatbuffers encoding and to my surprise, flatbuffers measured at 2.5X slower.

Their test:

auto start = std::chrono::high_resolution_clock::now();
for (size_t i = 0; i < iterations; i++) {
    flatbuffers::FlatBufferBuilder builder;
    strings.clear();

    for (size_t i = 0; i < kStringsCount; i++) {
        strings.push_back(builder.CreateString(kStringValue));
    }

    auto ids_vec = builder.CreateVector(kIntegers);
    auto strings_vec = builder.CreateVector(strings);
    auto r1 = CreateRecord(builder, ids_vec, strings_vec);
    builder.Finish(r1);

    auto p = reinterpret_cast<char*>(builder.GetBufferPointer());
    auto sz = builder.GetSize();
    std::vector<char> buf(p, p + sz);
    auto r2 = GetRecord(buf.data());
    (void)r2->ids()[0];

    builder.ReleaseBufferPointer();
}
auto finish = std::chrono::high_resolution_clock::now();
auto duration = std::chrono::duration_cast<std::chrono::milliseconds>(finish - start).count();

As you can see, the benchmark creates a flatbuffers builder (a kind of std::vector). It stresses the builder by allocating an array of strings and ints… and that’s about it.

Internally, the flatbuffers::FlatbufferBuilder encodes from top-to-bottom (downward growth) and when it reaches the bottom, it reallocs and grows the size of the underlying array to fit the contents.

The actual code for reallocation is:

void reallocate(size_t len) {
    FLATBUFFERS_ASSERT(allocator_);
    auto old_reserved = reserved_;
    auto old_size = size();
    auto old_scratch_size = scratch_size();
    reserved_ += (std::max)(len,
                            old_reserved ? old_reserved / 2 : initial_size_);
    reserved_ = (reserved_ + buffer_minalign_ - 1) & ~(buffer_minalign_ - 1);
    if (buf_) {
      buf_ = allocator_->reallocate_downward(buf_, old_reserved, reserved_,
                                             old_size, old_scratch_size);
    } else {
      buf_ = allocator_->allocate(reserved_);
    }
    cur_ = buf_ + reserved_ - old_size;
    scratch_ = buf_ + old_scratch_size;
}

It grows by powers of 2 (reserved_ + buffer_minalign_ - 1) & ~(buffer_minalign_ - 1); , and usually by 1024 bytes which is the default initial_size_;

Setup

Let’s start with a Flatbuffers IDL of a simple key=value struct.

    table kvpair {
      key:   string;
      value: string;
    }

The C++ API has a nice object api on top of the raw flatbuffers::FlatbufferBuilder, so a quick GoogleBenchmark yields something like this:

static inline kvpairT
gen_kv(uint32_t sz) {
  kvpairT ret;
  ret.key.resize(sz, 'x');
  ret.value.resize(sz, 'y');
  return ret;
}

static void
BM_alloc_simple(benchmark::State &state) {
  for (auto _ : state) {
    state.PauseTiming();
    auto kv = gen_kv(state.range(0));
    state.ResumeTiming();

    // build it!
    flatbuffers::FlatBufferBuilder bdr;
    bdr.Finish(kvpair::Pack(bdr, &kv, nullptr));
    auto mem = bdr.Release();
    auto ptr = reinterpret_cast<char *>(mem.data());
    auto sz = mem.size();
    auto buf = seastar::temporary_buffer<char>(
      ptr, sz, seastar::make_object_deleter(std::move(mem)));
    benchmark::DoNotOptimize(buf);
  }
}
BENCHMARK(BM_alloc_simple)
  ->Args({1 << 1, 1 << 1})
  ->Args({1 << 2, 1 << 2})
  ->Args({1 << 4, 1 << 4})
  ->Args({1 << 8, 1 << 8})
  ->Args({1 << 12, 1 << 12})
  ->Args({1 << 16, 1 << 16})
  ->Args({1 << 18, 1 << 18});

A few things to note:

We use auto mem = bdr.Release(); and wrap that into a seastar::temporary_buffer<char> with effectively zero copy (+- some pointer assignments).
This is important because our entire messaging API is about bridging the code generation from flatbuffers and seastar.
In addition to this, our RPC mechanism codegen’s seastar <–> flatbuffers glue code.

---------------------------------------------------------------------------
Benchmark                                    Time           CPU Iterations
---------------------------------------------------------------------------
...

BM_alloc_simple/2/2/threads:1                        586 ns        588 ns    1184218
BM_alloc_simple/4/4/threads:1                        586 ns        589 ns    1181340
BM_alloc_simple/16/16/threads:1                      598 ns        600 ns    1153577
BM_alloc_simple/256/256/threads:1                    601 ns        602 ns    1147963
BM_alloc_simple/4096/4096/threads:1                  824 ns        828 ns     837036
BM_alloc_simple/65536/65536/threads:1               6825 ns       6845 ns     101321

...

To encode a 2 x 64KB buffers it takes 6.8 micros. Yikes! - But is this bad? (as of flatbuffers checkin 34cb163e389e928db08ed2bd0e16ee0ac53ab1ce). Note this is only 3X the cost of std::malloc + std::memset, which is the fastest thing I can think of as a base comparison.

Before we dive into possible optimizations, let’s fix our desktop for these micro optimizations.

Fixing the environment

function cpu_disable_performance_cpupower_state(){
    sudo cpupower frequency-set --governor powersave
}
function cpu_enable_performance_cpupower_state(){
    sudo cpupower frequency-set --governor performance
}
function cpu_available_frequencies() {
    for i in /sys/devices/system/cpu/cpu[0-9]*; do
        echo "$i:"
        echo "$i/cpufreq/scaling_min_freq: $(cat $i/cpufreq/scaling_min_freq)";
        echo "$i/cpufreq/scaling_max_freq: $(cat $i/cpufreq/scaling_max_freq)";
    done
}

function cpu_set_min_frequencies() {
    local freq=$1;
    if [[ $freq == "" ]]; then exit 1; fi
    for i in /sys/devices/system/cpu/cpu[0-9]*; do
        echo "$i:"
        echo "$i/cpufreq/scaling_min_freq: $(cat $i/cpufreq/scaling_min_freq)";
        echo "$freq" | sudo tee "$i/cpufreq/scaling_min_freq"
        echo "$i/cpufreq/scaling_min_freq: $(cat $i/cpufreq/scaling_min_freq)";
    done
}

function cpu_set_max_frequencies() {
    local freq=$1;
    if [[ $freq == "" ]]; then exit 1; fi
    for i in /sys/devices/system/cpu/cpu[0-9]*; do
        echo "$i:"
        echo "$i/cpufreq/scaling_max_freq: $(cat $i/cpufreq/scaling_max_freq)";
        echo "$freq" | sudo tee "$i/cpufreq/scaling_max_freq"
        echo "$i/cpufreq/scaling_max_freq: $(cat $i/cpufreq/scaling_max_freq)";
    done
}

My CPU is a Intel(R) Xeon(R) CPU E3-1535M v6 @ 3.10GHz with a turbo of 4.20GHz. That’s great for desktop experience where interactivity matters and terrible for performance benchmarking.

When I first tried to show the results to my partner (non CS Major, but puts up w/ me asking her to stare at my screen) I had unpredictable results.

Armed w/ the shell funcs above, my checklist is now as follows:

Settings

Ensure that your BIOS says performance when connected to AC
Check your CPU freqencies via cat /proc/cpuinfo | grep model
Set your CPU governor via cpu_enable_performance_cpupower_state
Set the min frequency to the frequency reported by your CPU/model/vendor via: cpu_set_min_frequencies 3100000 in my case
Set the max frequency to the frequency reported by your CPU/model/vendor via: cpu_set_max_frequencies 3100000 in my case
Verify that you always build in Release mode

Verying the frequencies is now simple via cpu_available_frequencies

 $> cpu_available_frequencies
/sys/devices/system/cpu/cpu0:
/sys/devices/system/cpu/cpu0/cpufreq/scaling_min_freq: 3100000
/sys/devices/system/cpu/cpu0/cpufreq/scaling_max_freq: 3100000
/sys/devices/system/cpu/cpu1:
/sys/devices/system/cpu/cpu1/cpufreq/scaling_min_freq: 3100000
/sys/devices/system/cpu/cpu1/cpufreq/scaling_max_freq: 3100000
/sys/devices/system/cpu/cpu2:
/sys/devices/system/cpu/cpu2/cpufreq/scaling_min_freq: 3100000
/sys/devices/system/cpu/cpu2/cpufreq/scaling_max_freq: 3100000
/sys/devices/system/cpu/cpu3:
/sys/devices/system/cpu/cpu3/cpufreq/scaling_min_freq: 3100000
/sys/devices/system/cpu/cpu3/cpufreq/scaling_max_freq: 3100000
/sys/devices/system/cpu/cpu4:
/sys/devices/system/cpu/cpu4/cpufreq/scaling_min_freq: 3100000
/sys/devices/system/cpu/cpu4/cpufreq/scaling_max_freq: 3100000
/sys/devices/system/cpu/cpu5:
/sys/devices/system/cpu/cpu5/cpufreq/scaling_min_freq: 3100000
/sys/devices/system/cpu/cpu5/cpufreq/scaling_max_freq: 3100000
/sys/devices/system/cpu/cpu6:
/sys/devices/system/cpu/cpu6/cpufreq/scaling_min_freq: 3100000
/sys/devices/system/cpu/cpu6/cpufreq/scaling_max_freq: 3100000
/sys/devices/system/cpu/cpu7:
/sys/devices/system/cpu/cpu7/cpufreq/scaling_min_freq: 3100000
/sys/devices/system/cpu/cpu7/cpufreq/scaling_max_freq: 3100000

The take away is you don’t necessarily want speed, you want predictability.

Next - Compiler Explorer - my new fav tool

First, let’s execute the commands above and get the compiler flags that we’ll need for compiler explorer.

If you are using CMake, follow these screencasts and don’t forget to set set(CMAKE_EXPORT_COMPILE_COMMANDS 1) on your main CMakeLists.txt file

Now that we have the compiler flags ready, let’s fire up compiler explorer:

When you navigate to localhost:1024 you are welcomed with the usual and friendly CompilerExplorer UI.

emacs

With these tools we can now benchmark & optimize our code

The results

Running /home/agallego/workspace/smf/build/release/src/benchmarks/fbs_alloc/smf_fbsalloc_benchmark_test
Run on (8 X 4200 MHz CPU s)
CPU Caches:
  L1 Data 32K (x4)
  L1 Instruction 32K (x4)
  L2 Unified 256K (x4)
  L3 Unified 8192K (x1)
---------------------------------------------------------------------------
Benchmark                                    Time           CPU Iterations
---------------------------------------------------------------------------
BM_malloc_base/2/2/threads:1                          11 ns         11 ns   62858390
BM_malloc_base/4/4/threads:1                          11 ns         11 ns   61565906
BM_malloc_base/16/16/threads:1                        11 ns         11 ns   64118519
BM_malloc_base/256/256/threads:1                      20 ns         20 ns   35574778
BM_malloc_base/4096/4096/threads:1                    98 ns         98 ns    7123958
BM_malloc_base/65536/65536/threads:1                2339 ns       2336 ns     281254
BM_malloc_base/262144/262144/threads:1             11648 ns      11635 ns      58503
BM_alloc_simple/2/2/threads:1                        586 ns        588 ns    1184218
BM_alloc_simple/4/4/threads:1                        586 ns        589 ns    1181340
BM_alloc_simple/16/16/threads:1                      598 ns        600 ns    1153577
BM_alloc_simple/256/256/threads:1                    601 ns        602 ns    1147963
BM_alloc_simple/4096/4096/threads:1                  824 ns        828 ns     837036
BM_alloc_simple/65536/65536/threads:1               6825 ns       6845 ns     101321
BM_alloc_simple/262144/262144/threads:1            32838 ns      32784 ns      21351
BM_alloc_thread_local/2/2/threads:1                  557 ns        559 ns    1220122
BM_alloc_thread_local/4/4/threads:1                  557 ns        559 ns    1221065
BM_alloc_thread_local/16/16/threads:1                569 ns        571 ns    1198623
BM_alloc_thread_local/256/256/threads:1              578 ns        580 ns    1181208
BM_alloc_thread_local/4096/4096/threads:1            806 ns        819 ns     845698
BM_alloc_thread_local/65536/65536/threads:1        10182 ns      10237 ns      68206
BM_alloc_thread_local/262144/262144/threads:1      43066 ns      42998 ns      16279
BM_alloc_hybrid/2/2/threads:1                        563 ns        565 ns    1212235
BM_alloc_hybrid/4/4/threads:1                        563 ns        566 ns    1209207
BM_alloc_hybrid/16/16/threads:1                      577 ns        579 ns    1185042
BM_alloc_hybrid/256/256/threads:1                    587 ns        589 ns    1166182
BM_alloc_hybrid/4096/4096/threads:1                  843 ns        847 ns     819194
BM_alloc_hybrid/65536/65536/threads:1               6810 ns       6832 ns     101497
BM_alloc_hybrid/262144/262144/threads:1            32838 ns      32774 ns      21368

As expected, for small buffers, there is a large cost vs the base of malloc+memset, and at the higher ends, the allocation & byte traversal start to dominate. The range is ~58x (worst) - ~2.8X (best).

Our benched code

#include <cstring>
#include <memory>
#include <thread>

#include <benchmark/benchmark.h>
#include <core/print.hh>
#include <core/sstring.hh>
#include <core/temporary_buffer.hh>

#include "kv_generated.h"
#include "smf/native_type_utils.h"

static inline kvpairT
gen_kv(uint32_t sz) {
  kvpairT ret;
  ret.key.resize(sz, 'x');
  ret.value.resize(sz, 'y');
  return ret;
}

static void
BM_malloc_base(benchmark::State &state) {
  for (auto _ : state) {
    auto buf = reinterpret_cast<char *>(std::malloc(2 * state.range(0)));
    std::memset(buf, 'x', 2 * state.range(0));
    benchmark::DoNotOptimize(buf);
    std::free(buf);
  }
}
BENCHMARK(BM_malloc_base)
  ->Args({1 << 1, 1 << 1})
  ->Args({1 << 2, 1 << 2})
  ->Args({1 << 4, 1 << 4})
  ->Args({1 << 8, 1 << 8})
  ->Args({1 << 12, 1 << 12})
  ->Args({1 << 16, 1 << 16})
  ->Args({1 << 18, 1 << 18})
  ->Threads(1);


static void
BM_alloc_simple(benchmark::State &state) {
  for (auto _ : state) {
    state.PauseTiming();
    auto kv = gen_kv(state.range(0));
    state.ResumeTiming();

    // build it!
    flatbuffers::FlatBufferBuilder bdr;
    bdr.Finish(kvpair::Pack(bdr, &kv, nullptr));
    auto mem = bdr.Release();
    auto ptr = reinterpret_cast<char *>(mem.data());
    auto sz = mem.size();
    auto buf = seastar::temporary_buffer<char>(
      ptr, sz, seastar::make_object_deleter(std::move(mem)));
    benchmark::DoNotOptimize(buf);
  }
}
BENCHMARK(BM_alloc_simple)
  ->Args({1 << 1, 1 << 1})
  ->Args({1 << 2, 1 << 2})
  ->Args({1 << 4, 1 << 4})
  ->Args({1 << 8, 1 << 8})
  ->Args({1 << 12, 1 << 12})
  ->Args({1 << 16, 1 << 16})
  ->Args({1 << 18, 1 << 18})
  ->Threads(1);

static void
BM_alloc_thread_local(benchmark::State &state) {
  static thread_local flatbuffers::FlatBufferBuilder bdr(1024);
  for (auto _ : state) {
    state.PauseTiming();
    auto kv = gen_kv(state.range(0));
    state.ResumeTiming();

    // key operations
    bdr.Clear();
    bdr.Finish(kvpair::Pack(bdr, &kv, nullptr));
    // ned key operations

    // std::cout << "Size to copy: " << bdr.GetSize() << std::endl;

    void *ret = nullptr;
    auto r = posix_memalign(&ret, 8, bdr.GetSize());
    if (r == ENOMEM) {
      throw std::bad_alloc();
    } else if (r == EINVAL) {
      throw std::runtime_error(seastar::sprint(
        "Invalid alignment of %d; allocating %d bytes", 8, bdr.GetSize()));
    }
    DLOG_THROW_IF(r != 0,
      "ERRNO: {}, Bad aligned allocation of {} with alignment: {}", r,
      bdr.GetSize(), 8);
    std::memcpy(ret, reinterpret_cast<const char *>(bdr.GetBufferPointer()),
      bdr.GetSize());
    benchmark::DoNotOptimize(ret);
    std::free(ret);
  }
}
BENCHMARK(BM_alloc_thread_local)
  ->Args({1 << 1, 1 << 1})
  ->Args({1 << 2, 1 << 2})
  ->Args({1 << 4, 1 << 4})
  ->Args({1 << 8, 1 << 8})
  ->Args({1 << 12, 1 << 12})
  ->Args({1 << 16, 1 << 16})
  ->Args({1 << 18, 1 << 18})
  ->Threads(1);


template <typename RootType>
seastar::temporary_buffer<char>
hybrid(const typename RootType::NativeTableType &t) {
  static thread_local auto builder =
    std::make_unique<flatbuffers::FlatBufferBuilder>();

  auto &bdr = *builder.get();
  bdr.Clear();
  bdr.Finish(RootType::Pack(bdr, &t, nullptr));

  if (SMF_UNLIKELY(bdr.GetSize() > 2048)) {
    auto mem = bdr.Release();
    auto ptr = reinterpret_cast<char *>(mem.data());
    auto sz = mem.size();

    // fix the original builder
    builder = std::make_unique<flatbuffers::FlatBufferBuilder>();

    return seastar::temporary_buffer<char>(
      ptr, sz, seastar::make_object_deleter(std::move(mem)));
  }


  // always allocate to the largest member 8-bytes
  void *ret = nullptr;
  auto r = posix_memalign(&ret, 8, bdr.GetSize());
  if (r == ENOMEM) {
    throw std::bad_alloc();
  } else if (r == EINVAL) {
    throw std::runtime_error(seastar::sprint(
      "Invalid alignment of %d; allocating %d bytes", 8, bdr.GetSize()));
  }
  DLOG_THROW_IF(r != 0,
    "ERRNO: {}, Bad aligned allocation of {} with alignment: {}", r,
    bdr.GetSize(), 8);
  seastar::temporary_buffer<char> retval(reinterpret_cast<char *>(ret),
    bdr.GetSize(), seastar::make_free_deleter(ret));
  std::memcpy(retval.get_write(),
    reinterpret_cast<const char *>(bdr.GetBufferPointer()), retval.size());
  return std::move(retval);
}

static void
BM_alloc_hybrid(benchmark::State &state) {
  for (auto _ : state) {
    state.PauseTiming();
    auto kv = gen_kv(state.range(0));
    state.ResumeTiming();
    auto buf = hybrid<kvpair>(kv);
    benchmark::DoNotOptimize(buf);
  }
}
BENCHMARK(BM_alloc_hybrid)
  ->Args({1 << 1, 1 << 1})
  ->Args({1 << 2, 1 << 2})
  ->Args({1 << 4, 1 << 4})
  ->Args({1 << 8, 1 << 8})
  ->Args({1 << 12, 1 << 12})
  ->Args({1 << 16, 1 << 16})
  ->Args({1 << 18, 1 << 18})
  ->Threads(1);


BENCHMARK_MAIN();

Lessons Learned: Benchmarking with turbo is dishonest.

Not that people writing OSS (or any software really) are out there to get you. It is just easy to forget to tune your machine specifically for benchmarking. There is no such thing as a quick and dirty benchmark especially if the results are not categorically different, i.e.: 1 minute vs 10 mins vs 1hr.

Let’s compare the stability of multiple runs with turbo vs without.

cpu jitter

fbsalloc

That’s a stddev difference of up to 768X !!!

No-turbo means precisely this:

cpu_available_frequencies

/sys/devices/system/cpu/cpu0:
/sys/devices/system/cpu/cpu0/cpufreq/scaling_min_freq: 800000
/sys/devices/system/cpu/cpu0/cpufreq/scaling_max_freq: 4200000
/sys/devices/system/cpu/cpu1:
/sys/devices/system/cpu/cpu1/cpufreq/scaling_min_freq: 800000
/sys/devices/system/cpu/cpu1/cpufreq/scaling_max_freq: 4200000
/sys/devices/system/cpu/cpu2:
/sys/devices/system/cpu/cpu2/cpufreq/scaling_min_freq: 800000
/sys/devices/system/cpu/cpu2/cpufreq/scaling_max_freq: 4200000
/sys/devices/system/cpu/cpu3:
/sys/devices/system/cpu/cpu3/cpufreq/scaling_min_freq: 800000
/sys/devices/system/cpu/cpu3/cpufreq/scaling_max_freq: 4200000
/sys/devices/system/cpu/cpu4:
/sys/devices/system/cpu/cpu4/cpufreq/scaling_min_freq: 800000
/sys/devices/system/cpu/cpu4/cpufreq/scaling_max_freq: 4200000
/sys/devices/system/cpu/cpu5:
/sys/devices/system/cpu/cpu5/cpufreq/scaling_min_freq: 800000
/sys/devices/system/cpu/cpu5/cpufreq/scaling_max_freq: 4200000
/sys/devices/system/cpu/cpu6:
/sys/devices/system/cpu/cpu6/cpufreq/scaling_min_freq: 800000
/sys/devices/system/cpu/cpu6/cpufreq/scaling_max_freq: 4200000
/sys/devices/system/cpu/cpu7:
/sys/devices/system/cpu/cpu7/cpufreq/scaling_min_freq: 800000
/sys/devices/system/cpu/cpu7/cpufreq/scaling_max_freq: 4200000

SMF

smf just got 40% faster for large buffers thanks to these benchmarks, if you give it a shot let me know. Stay tuned for the Java and Go code generators and performance benchmarks coming soon.

Let me know if you found this useful, any missinformation, or additional performance tunning on twitter @emaxerrno.

Special thanks to my partner Sarah Rohrbach as well as Chris Heller, Noah Watkins for reading earlier drafts of this post.