C++ Iterator-Friendly Branchless Binary Search

Once you delve into the realm of low-latency C++, you will find yourself waking up in the middle of the night, sweating profusely from a nightmare concerning unnecessary branching. And soon after, you begin to over-optimize your code to avoid branches. Even when that part of your code is clearly not the bottleneck.

This is a short post presenting a C++ iterator-friendly implementation of a branchless binary search implementation. It is short and sweet, so I will reveal it before some editorial comments and thanks:

template<typename It>
It lower_bound(It begin, It end, const typename It::value_type& value) {
  auto len = std::distance(begin, end);
  if (len == 0) {
    return end;
  }

  while (len > 1) {
    const auto half = len / 2;
    begin += (*(begin + half - 1) < value) * half;
    len -= half;
  }
  return (*begin < value) ? end : begin;
}

This implementation uses a begin iterator and len integer to keep track of the search space rather than begin and end pointers/indices. So the first line of the function are simply using the provided iterators to derive the needed variables. We terminate early if the range is empty. All classic binary search branches within the while loop body have been removed:

  • Shrinking the search space length without branches is easy; it will always halve. We can do this safely by subtracting halfLen from the remaining len (This will potentially leave an array of size 1).
  • Calculating the new search space start is more tricky. It will either be the same begin as now, or be the halfway point between pos and the end of the search space. So we conditionally add halfLen to begin.
  • To conform to the std::lower_bound interface, we must return the end iterator if the value is not found within the range. To do so, unfortunately we must add a final conditional at the end of the function. Thanks to Farhid Mehrabi for pointing this bug out in my initial implementation (I would potentially return a value that is less than the value, breaking the lower bound contract). Adding a simple unit test would have prevented my from doing so, so this a reminder to myself to do so even for small pieces of code.

For the most part, this implementation will have superior performance to std::lower_bound. There is caveat: since the addition to begin will most likely be turned into a CMOV instruction rather than a proper branch, there will be no prediction to preload the next search space mid-points and at larger array sizes, a classic approach will prevail. This can potentially be mitigated on some platforms by adding an explicit prefetch instruction.

Thanks to this amazing article for outlining this concept using raw arrays.

Benchmarks

Nanoseconds taken to find 1000 random numbers that are in the range on Macbook Air M1. Code.

=============================
array size = 1
branchlessTime = 18000
stdTime        = 18000
=============================
array size = 2
branchlessTime = 18000
stdTime        = 30000
=============================
array size = 4
branchlessTime = 19000
stdTime        = 39000
=============================
array size = 8
branchlessTime = 19000
stdTime        = 48000
=============================
array size = 16
branchlessTime = 20000
stdTime        = 59000
=============================
array size = 32
branchlessTime = 21000
stdTime        = 69000
=============================
array size = 64
branchlessTime = 21000
stdTime        = 78000
=============================
array size = 128
branchlessTime = 22000
stdTime        = 91000
=============================
array size = 256
branchlessTime = 24000
stdTime        = 103000
=============================
array size = 512
branchlessTime = 28000
stdTime        = 113000
=============================
array size = 1024
branchlessTime = 31000
stdTime        = 127000
=============================
array size = 2048
branchlessTime = 35000
stdTime        = 139000
=============================
array size = 4096
branchlessTime = 38000
stdTime        = 146000
=============================
array size = 8192
branchlessTime = 42000
stdTime        = 140000
=============================
array size = 16384
branchlessTime = 40000
stdTime        = 139000
=============================
array size = 32768
branchlessTime = 45000
stdTime        = 149000
=============================
array size = 65536
branchlessTime = 64000
stdTime        = 189000
=============================
array size = 131072
branchlessTime = 79000
stdTime        = 216000
=============================
array size = 262144
branchlessTime = 91000
stdTime        = 233000
=============================
array size = 524288
branchlessTime = 105000
stdTime        = 260000
=============================
array size = 1048576
branchlessTime = 195000
stdTime        = 351000
=============================
array size = 2097152
branchlessTime = 142000
stdTime        = 333000
=============================
array size = 4194304
branchlessTime = 397000
stdTime        = 444000
=============================
array size = 8388608
branchlessTime = 931000
stdTime        = 961000
=============================
array size = 16777216
branchlessTime = 1159000
stdTime        = 996000
=============================
array size = 33554432
branchlessTime = 1327000
stdTime        = 1132000
=============================
array size = 67108864
branchlessTime = 1507000
stdTime        = 1578000
=============================
array size = 134217728
branchlessTime = 1603000
stdTime        = 1710000
=============================
array size = 268435456
branchlessTime = 5921000
stdTime        = 5980000
=============================
array size = 536870912
branchlessTime = 23558000
stdTime        = 15002000

As you can see the benefits start to deteriorate as the array size grows.