Introduction To Low Latency Programming: External Processing

This post originally appears as a chapter in my new book: ‘Introduction To Low Latency Programming’, a short and approachable entry into the subject. Available now for purchase on Amazon.

After heavily scrutinizing your program scope you should be left with some functional requirements that are absolutely mandatory, as well as some latency goals you are aiming for. From here, you can start thinking about which pieces of the program could potentially be extracted from the main execution path, to keep it as fast as possible. For our purposes, the main execution path or task is process, system or function that is measured by our defined metrics. Here are some questions you can ask yourself to prepare for this process:

  • What is known before the program even runs?
  • Can we use this information to ‘do’ some steps before runtime?
  • Does any part of the runtime process need to be done ‘real time’?
  • Can any of the runtime steps be isolated into a task that can be packaged into a cohesive unit?
  • What is the potential cost of communicating with the main task?

Pre-processing

When we are discussing ‘low latency’ we are almost always referring to latency at runtime. Therefore, if we can remove segments of the program functionality and execute them before we enter our low latency execution; we can effectively get them done for ‘free’.

If you have a lot of information that is used as an input to the runtime computation available to you before the process is even started, you can attempt to perform some parts of the computation even before the program is run. The following subsections discuss some common approaches.

Configuration/Input Files + Loading On Start Up

One approach is to generate configuration/input files and supply them to the process at startup. This is very simple from an operation and code perspective. You can create a separate process, or group of processes, that reads from any number of inputs, performs necessary computations and then outputs the results into a file. This file is then read by the low-latency process at start up. Steps:

  1. Run external process that completes and generates a file.
  2. Start the low-latency task/runtime, supplying the file as input.

Additionally, if you don’t need the preparation functionality to be uncoupled from the process, you can have the main process itself do this computation before it enters the low-latency execution phase. This is simple and has an even lower operational overhead as the preparation code lives in the same code ‘space’ as the low-latency code. Warming process steps:

  1. Warming up: read various database tables into memory, calculate some weights, etc.
  2. Enter low-latency runtime loop.

Compile Time Processing

Another option is ‘hard-coding’ computation into the executable with compile-time programming options: code generation, templates, and ‘constexpr’ functions, methods and variables. These approaches can help reduce instruction count/cost and branching; both key goals of low latency programming (discussed later in the book). While they can provide ‘the fastest’ runtime, they may also be difficult to use in comparison to runtime options. However, they do allow us to squeeze out maximum performance and are often worth pursuing. They are ‘external’ in the sense that they don’t occur during our low-latency runtime operations, because they don’t occur at all. The removal decision is in some sense a type of processing.

Code Generation

Code generation takes on many forms. It is the creation of programs that generate code as output. The inputs provided to the code generation program will determine the end functionality included in the final code. This allows us remove ‘work’ and decision-making out of the final runtime and perform these ‘steps’ before or at compile time. The code generator does not have to meet your low latency goals itself, as it is the final runtime process performance that actually matters. Here is a sample code generation script execution:

python3 generateCode.py --input inputs.json --output output.cpp

The inputs.json file will dictate what will be included in the final code within the output.cpp file.

Compile Flags and Templates

You can build programs that use C++ template arguments for specialization: both for types and values. Then, at compile time you can supply flags that will select the correct specialization. Everything unrelated to your supplied options will not be included in the runtime. The following is a super simple example:

#ifdef BUYER
  using OrderManager = BuyOrderManager;
#else
  using OrderManager = SellOrderManager;
#endif
System<OrderManager> orderManager;

The System class takes OrderManager as a template argument and passes it down to other child members. The value of this template argument changes its runtime behavior at compile time; effectively doing the ‘decision-making’ before the program is executed. Of course, we need to pass in the BUYER flag during compilation:

g++ -o trade_app trade_app.cpp -DBUYER

Leveraging ‘constexpr’ Utilities

Modern C++ (since C++11) has given us constexpr which allows marking variables, functions, classes and methods with a hint that expressions can be resolved at compile time. Here are the ‘rules’:

  • A constexpr variable must be initialized as a constexpr type with a constepxr function. These must be constructed/called with either literals or other constexpr variables as arguments.
  • A constexpr function can only take and return constexpr types. The key thing to know is that arguments can be accepted as constexpr, but not passed on as constexpr. Once within the function you cannot guarantee the argument values are known at compile time, even though they might be.
  • A constexpr type/class can only have constexpr type members and must at least one constexpr constructor. All scalar types and arrays are considered constexpr.

Even with this constrained set of functionality, we can do some pretty fancy stuff. All the way from specializing function bodies using template parameters to parsing JSON at compile time. The rabbit hole goes deep. In a way it is a form of ‘code generation’ included in C++. Other languages have other forms of compile time pre-processing.

Using constexpr if and a template argument to specialize function body:


template<bool DoOne>
void process() {
  if constexpr (DoOne) {
    // ...
  } else {
    // ...
  }
}

Calculating hash at compile time:

template<size_t Length>
constexpr size_t hash(const char(&str)[Length]) {
  size_t result = 0;
  for (size_t i = 0; i < (Length - 1); i++) {
    result ^= str[i] << (i % 8);
  }
  return result;
}

// ...
constexpr auto hashValue = hash("Hello, World!");

Every subsequent version of C++ has significantly boosted the functionality provided by constexpr. Make as much of your code and static data ‘constexpr’ as possible. I have to note something here; The benefit of constexpr is not just the reduction of instructions. You could do that with your own precalculation and storage of static data in an executable written in assembly. This, however, would be incredibly tedious. constexpr is powerful because it treats compile-time resolved data just as it does code; something that can be understood by the programmer and to be updated as objectives/requirements evolve. That is the true power of the constexpr ‘contract’.

External Parallel Processing

Sometimes information is not known before runtime; it requires input from data received during runtime or is updated throughout the day from external sources or computation. This section will discuss a few ways that external processing can be done when the program is already running.

Separate Process

Our first option is to have external processes that perform the extracted tasks. This is usually the case for data that does not require inputs from the process itself. Perhaps this is the result of some grid computing batch result or incoming network data. In any case, after the external process is done, it needs a way to pass this data into the already-running main process. The selection of communication method is dependent on data size and the latency requirement. If this is a change that occurs a few times a day we’d use one approach as opposed to a message that is sent every 10 microseconds.

If this communication rarely happens we can choose a simple method with a low operational overhead such as writing the results to a file and signalling the already running process to read the file. This would be useful for something like a configuration file change or perhaps loading a new machine learning model weights. SIGUSR1 and SIGUSR2 are the signals reserved for the program’s own use. The safest way to do this is to register simple flag-setting handler functions at startup which will be called when the signal is received. Handler complexity itself should be kept to a minimum:

#include <signal.h>
#include <iostream>
#include <thread>

// Global variable to indicate if a reload is needed
bool performReload = false;

// Signal handler
void handler(int) {
  performReload = true;
}

int main() {
  // Register signal handler
  signal(SIGUSR1, handler);

  // Enter main loop
  while(true) {
    if (performReload) {
      std::cout << "Reloading configuration..." << std::endl;
      performReload = false;
    }
    // Actual work
    std::this_thread::sleep_for(std::chrono::seconds(1));
  }
  return 0;
}

However, if this communication stream is almost constant, we would have to select a different option. For low latency programs, the common choice is using shared memory with atomic access operations. This leverages operating system constructs that allow two different processes to access the same block of memory; something forbidden by default. Atomic operations are instructions which guarantee atomic accesses and updates of memory. In simpler terms; you are able to assume that no other thread/process is able to read/write to the variable while you are reading or writing to it. The other accessor will either get the old version or the new version depending on instruction scheduling and/or atomic sequencing level chosen. Without using atomic operations you might read memory with an in-between state. There are multiple levels of guarantee:

  • Relaxed
  • Consume
  • Release
  • Acquire
  • Acquire/Release
  • Sequentially Consistent

You can read more about them by reading the documentation of std::atomic::memory_order. In general, as you move down the list, the guarantees get stronger and stronger about how memory is to be written/read as other threads/processes do the same. Really think about how your memory will be accessed and what the reader/writer really needs to be guaranteed in order to fulfill their objectives. Keep in mind: stronger guarantees cost more; time-wise.

Once you have two processes accessing the same memory space, you can design data structures that enable communication between the two. The favorite choice for low latency applications is a ring buffer implemented as a contiguous array in memory. The API/design differs based on different access scenarios; mainly affected by the number of simultaneous consumers or producers. A single producer/consumer queue is very simple to implement. The following example is not optimized, but captures the general idea. Try to walk through how the reader and writer each would access the memory with atomic and non-atomic instructions. Think about how the access indices interact and why they begin with an offset from one another. After reading about std::atomic::memory_order, which guarantees do we need for which operations? By default, they use memory_order_seq_cst; with the strictest guarantees.

A simple single producer, single consumer wait-free queue:

template<typename T, size_t N>
class WaitFreeQueue {
  T _data[N];
  std::atomic<size_t> _readSequence = 0;
  std::atomic<size_t> _writeSequence = 1;
public:
  WaitFreeQueue() = default;

  bool tryWrite(T value) {
    const auto nextWriteIndex = _writeSequence % N;
    const auto currentReadIndex = _readSequence % N;
    const bool noRoomLeft = (nextWriteIndex == currentReadIndex);
    if (noRoomLeft) {
      return false;
    }
    _data[nextWriteIndex] = std::move(value);
    _writeSequence.store(nextWriteIndex + 1);
    return true;
  }

  T* tryRead() {
    const auto nextReadIndex = (_readSequence + 1) % N;
    const auto nextWriteIndex = _writeSequence % N;
    const bool noNewData = (nextReadIndex == nextWriteIndex);
    if (noNewData) {
      return nullptr;
    }
    _readSequence.store(nextReadIndex);
    return &_data[nextReadIndex];
  }
};

To use this across processes, you would create a shared memory segment and construct this queue in the memory space in one process using placement new: new (shared_memory_ptr) WaitFreeQueue<int, 100>();. The other processes would simply interpret the shared memory pointer as the queue and use it.

If you try to extract the address pointer out of one process and use the raw value in another; your second process would be crashed by the operating system. It would be trying access memory outside its virtual memory space and this is forbidden.

Separate Thread

Sometimes this external computation requires data that the main process provides during runtime. In that case it might be more ergonomic to perform the computation in the same process but on a different thread. For the most part the performance distinction between processes and threads can be blurry. For many operating systems, they operate in the same way once you attach to the same shared memory region. The decision would be more so about the level of coupling you want between the foreground and background task runners. In this section we will be discussing a background task that is: 1. Critical to the runtime of the foreground task. 2. In constant back-and-forth communication with the foreground task. Therefore, it is probably better to use a thread for this job rather than a separate process. This can be summarized into the following:

  • Use a process if you want the background task to run uncoupled with the foreground task. I.e. you are okay with each task to run separately.
  • Use a thread if you want the background task to be coupled with the foreground task. I.e. you need both tasks to be running constantly.

One thing to note that is applicable to both separate thread and separate process asynchronous tasks: The communication mechanism overhead must be less costly to the foreground task then just performing the task itself. Otherwise, what is the point of doing it in the background?

Just as with separate processes, using a separate thread to perform tasks will ideally be coordinated using atomic variables as they can allow for ‘wait-free’ operations. You will not be waiting for a non-deterministic amount of time/CPU clock cycles. The same data structures/instructions you use with a separate process can be used across threads.

Chapter Summary

  • You can remove expensive computation outside your program to help reduce latency.
  • These computations can be done at compile time, at start-up, or in an external thread or process.
  • If the processing is done externally, ensure you select the best communication option for the job.

Related post