Imagine you were trying to write a very fast Write Ahead Log

…like - dunno - this: smf

…well you pack up, use it and go home.

Unless, of course, you are still waiting for your build system to finish and then you start digging deeper.

I wrote a wal_segment which basically allows you to write to a file in append only mode. Prior to it, I couldn’t control the flush rate to the disk. That is you can only - by definition of O_DIRECT - flush page-aligned memory buffers - and the seastar::file_output_stream() does not allow you to flush unfinished pages.

You deal with that by having to zero-out the tail of the unfinished page followed by a write, and then followed by a truncate call which will set the file to the correct size - minus all the zeros you wrote. I know zeroes are not strictly necessary, but wanted to make the reads safer - more on that later.

However, you are in a world where concurrency is the norm (as well as parallelism, but let’s focus on the structure - concurrency - not the simulataneous execution - parallelism) and right before you close the file, you want to flush all the remaining pages as fast as possible.

I couldn’t find any information on the web that specifically answered the question of

Can you write to the same file handle, multiple pages at the same time (dispatched at the same time), and each write need not be sequential.

That is, on pages 1,2,3,4 - write a sequence of 4,3,2,1 (worst case scenario)?

The answer on for my SSD (INTEL SSDSC2BP48), with XFS is yes.

What follows are the tests I wrote to prove it with my favorite systems framework seastar

The strategy below is as follows:

  • Write 10 pages
  • Allow for at most 4 concurrent execution of page writes
  • Always skip the first page - proving the point - explicitly - though implicitly proved by concurrent execution.
  • Make sure that you can read the file on the file system with less afterwards
This is what it looks like when you `less` the file on your terminal at page boundaries
page boundary of skiped page

A follow up of this will include how to safely dispatch half written pages and how to distribute the lock/semaphore contention - Spoiler alert: jump_consistent_hash()

Comparing Lemire's fastrange.h vs Google's jump-consistent-hashing
page boundary of skiped page

This idea from a set of unit tests I was writing to prove that my wal_segment behaved - at a high level - like the seastar::make_file_output_stream() handle seastar provides.

The exciting news is that you should expect an update with this concurrency primitive turned on for smf in the next month or so.

Let me know if you found this useful!

Join the smf mailing list.

Appendix

This full integration test.