Introduction To Low Latency Programming: Understand Storage
This post originally appears as a chapter in my new book: ‘Introduction To Low Latency Programming’, a short and approachable entry into the subject. Available now for purchase on Amazon.
So far we have focused on the computation aspect of low latency programming. However, optimized data access plays a huge role in this domain. No process is computation only, it needs data to compute on. This chapter will discuss the key ideas to keep in mind for the optimization of reading and writing from storage.
Be Aware Storage Performance And Access Costs
I am using the word ‘storage’ quite broadly in this chapter; referring to persistent storage, temporary storage, local storage and network storage. From our perspective, we mainly care about ‘time to access’, rather than durability, geographic location or any other ‘implementation’ details.
It is very important to know the rough time it takes to access different storage sinks. In general, this time increases exponentially as the physical distance from the actual CPU core increases. The following list is ordered from closest to furthest for a theoretical CPU core including the rough ‘time to access’ expressed as a multiplier:
- Registers: 1x
- L1 cache: 5x
- L2 cache: 10x
- L3 cache: 50x
- Same-NUMA memory: 100x
- Different-NUMA memory: 300x
- Persistent Storage (SSD/HDD/Tape): 10000x minimum
- Local Data Center: 200000x
- Distant Data Center: 100000000x
- The Moon: 1000000000000x
Our closest example storage type is the processor-local set of registers. These are the fastest accessible locations to the processor and are used as inputs and outputs for various computations. Most computers typically operate by loading data from the main memory into these registers, performing some instructions which save results to a register, and finally writing the values from registers into the main memory. Our code is turned into instructions which work with registers. The various levels of cache are optimizations for accessing the main memory.
The furthest example storage type is a theoretical network drive that is within a computer on the moon. If a program wants to read or write to it, it will have to issue the command over a network spanning 384,400 kilometers. It is obvious that until we develop spacetime wormholes, greater physical distance translates into higher communication latency.
Keep this list in mind when designing your programs. For low-latency applications, you will ideally never go beyond ‘same-NUMA memory’ after initial start-up is complete. NUMA (Non-Uniform Memory Architecture) is a multiprocessing architecture feature that assigns memory to specific CPU cores in an effort improve overall memory performance. The key takeaway: try to keep your data as close the CPU as possible. Of course, if our program has data persistence requirements, you will factor that in to your design.
Caching
With most modern architectures, continuous memory is loaded together as a group. When a memory address is requested to be read, surrounding data will be pulled into the various cache layers as well. This group of data is called a ‘cache line’. You can see the size of the cache line on your platform with the std::hardware_destructive_interference_size value in C++.
Keeping cache mechanics in mind is important as you design your algorithms and data structures. Factoring them into your program implementation can greatly increase performance and therefore reduce latency. In general, to keep a portion of code running as fast as possible, have it access as small a range of continuous memory as possible.
Put data that will be accessed as ‘a unit’ as close as possible; ideally within one cache-line for maximum performance. There are still huge benefits for using continuous memory beyond that. For example, adjacent cache lines are often prefetched. Generally use data structures that are as flat as possible (in contiguous memory) like static or dynamic arrays. They should be the default data structure of choice. Additionally, design your algorithms to fully process a block of contiguous memory before moving on (avoid going back). Sometimes this means re-arranging the order of some loops, such as with this simple matrix multiplication function:
Before:
using Vector = std::vector<float>;
using TwoDimMatrix = std::vector<Vector>;
TwoDimMatrix matMul(const TwoDimMatrix& m1, const TwoDimMatrix& m2)
{
const auto sharedDim = m1.front().size();
TwoDimMatrix result(m1.size(), Vector(m2.front().size()));
// Iterate over rows of m1
for (size_t i = 0; i < m1.size(); ++i) {
// Iterate over rows of m2
for (size_t j = 0; j < m2.size(); ++j) {
// Iterate over cells of m1 row and m2 column
for (size_t k = 0; k < sharedDim; ++k) {
result[i][j] += m1[i][k] * m2[k][j];
}
}
}
return result;
}
After:
TwoDimMatrix matMul(const TwoDimMatrix& m1, const TwoDimMatrix& m2)
{
const auto sharedDim = m1.front().size();
TwoDimMatrix result(m1.size(), Vector(m2.front().size()));
// Iterate over rows of m1
for (size_t i = 0; i < m1.size(); ++i) {
// Iterate over cells of the m1 row and rows of m2
for (size_t k = 0; k < sharedDim; ++k) {
// Iterate over cells of the m2 row
for (size_t j = 0; j < m2.size(); ++j) {
result[i][j] += m1[i][k] * m2[k][j];
}
}
}
return result;
}
All we really did was switch the k and j loop nesting order. This change can yield up to a 10x speed up as iterating over a column is much more expensive than iterating through a row. Why? The row is continuous in memory, whereas the column is not. Every time we access the next value in the column we are actually accessing a relatively ‘distant’ part of memory.
Another thing to keep in mind is ‘false sharing’. If you are using atomic instructions on two memory locations that are within a single cache line, you could inadvertently slow down your program as the CPU will send an instruction to lock the cache line from other cores. This is unavoidable and desired if the two cores are actually accessing the same memory location, but if it just happens to be nearby and not the same exact memory location, we are paying a price for nothing. To mitigate this, you want to make sure the two memory addresses are on separate cache lines using the C++ alignas
keyword:
struct DoNotShare {
alignas(std::hardware_destructive_interference_size) std::atomic<int> one;
alignas(std::hardware_destructive_interference_size) std::atomic<int> two;
};
This will prevent any false sharing from occurring by aligning the variables to offsets defined by the integer value returned by std::hardware_destructive_interference_size. Keep this in mind when using atomic instructions on nearby memory addresses.
Chapter Summary
- Know the rough time taken to access different storage sinks. From same-NUMA memory cache line all the way to a computer sending RF signals from the moon.
- Understand the cache characteristics of your hardware.
- Keep in mind that that measurement is key; educated assumptions can be useful early on in the design process, but to derive conclusions you must measure performance yourself.