Introduction To Low Latency Programming: Minimize Branching And Jumping

This post originally appears as a chapter in my new book: ‘Introduction To Low Latency Programming’, a short and approachable entry into the subject. Available now for purchase on Amazon.

This chapter will discuss how branching and jumping in our code affects our runtime performance and how we can avoid them in our effort to reduce the latency of our programs. Branching refers to process execution that can go down one of multiple paths. This functionality is provided by ‘check and jump’ instructions. Jumping refers to when the program control pointer is changed to a different location in memory rather than progressing to the next sequential instruction. This can be conditional (as a part of branching) or unconditional.

What Does It Cost?

Why do we want to discourage Jumping and/or Branching? Firstly, ‘comparing and jumping’ simply adds more instructions. As mentioned throughout the book, we want to minimize instructions overall. Secondly, branching disrupts the smooth flow of instruction prefetching, meaning if a condition that was not guessed by the branch predictor is encountered, the instruction pipeline has to potentially be flushed and refreshed. A new sequence of instructions will have to be retrieved.

The branch predictor is an element of CPUs that aims to predict which branch will be taken in our code during runtime. Branch predictors have gotten more and more complex, but the fundamental idea is maintaining branching statistics and using them to determine where a program control address branch will most likely jump to. As mentioned in the previous paragraph, that prediction allows the pre-fetch of upcoming instructions.

More instructions and instruction pipeline invalidation both contribute to the stalling of program execution. And stalling means we are not running the important instructions that directly contribute to fulfilling the business objectives. In an ideal world, our program would run these ‘golden instructions’ uninhibited.

Examples Of Branching And Jumping

The following short list contains some common code constructs which will generate jumping and/or branching instruction constructs:

  • Branching and Jumping: If statements, Chained boolean conditionals, Virtual method calls
  • Jumping: Function calls

Steps To Reduce Branching/Jumping

Branchless Boolean Expressions

Standard boolean expression operations such as && and || have ‘short-circuiting’ as a feature; if previous boolean expressions evaluate to false, consecutive expressions will not be evaluated:

const bool andExpression =
  false /*will be evaluated*/ &&
  true /* will not be evaluated*/;
const bool orExpression =
  true /*will be evaluated*/ ||
  false /* will not be evaluated*/;

As expected, this introduces branching to our code and should be considered for removal. What do we use instead? You can use the bitwise AND and OR expressions:

const bool andExpression = false & true; // Both will be evaluated
const bool orExpression = true | false; // Both will be evaluated

When should you switch to this approach? There is a simple rule of thumb: switch from logical to bitwise boolean expressions when the overhead cost of branching does not outweigh the cost of evaluating all terms in the expression.

It still makes sense to use logical boolean expressions if consecutive expressions are expensive to evaluate. Such the following:

const bool result = cheapCheck() && expensiveCheck();

The short-circuiting branch allows us to avoid the expensive check, which the branchless, bitwise version would not. Tip: For the logical flavor, it makes sense to arrange the expressions in a chain from least to most expensive to enable short-circuiting before the largest computation costs would be incurred.

Take a good look at the cost of the expression terms before deciding on using the branching or branchless boolean expressions.

Write Code Which Generates ‘CMOV’-like instructions

CMOV is a built-in instruction on the x86 architecture which does a ‘conditional move (copy)’. While spiritually similar to a branch, it is significantly cheaper than real ‘check and jump’. Since we are not writing assembly by hand, we won’t be directly implementing CMOV use. Rather, we will try to use code constructs that will be turned into CMOV instructions. One of these is ternary operator. Regardless, a good compiler will use CMOV when it can, even for if statements (when they simply assign to a value to a variable).

Ternary operator use:

const int output = useFortyTwo() ? 42 : 24;

Make sure to measure the actual effect of using CMOV instructions with your program and data. Sometimes well-predicted branches can be more performant than CMOV.

Remove The Use Of Virtual Functions

Virtual function calls perform a jump. When it comes time to call the virtual function, the generated virtual lookup table is consulted for that class. Based on the values in the table, the correct implementation of the method will be ‘jumped to’ to in the executable. This incurs a cost to load the vtable as well a branching jump. Ideally, we will avoid virtual function calls in our programs. If you are using purely virtual classes as a form of interface, you can consider using C++20 concepts instead. They will allow you to constrain generic code and allow you to create ‘interfaces’ via requires.

Of course, for leveraging dynamic dispatch (runtime specified code execution) there is no alternative to runtime branching / jumping. Other approaches such as using sum types (std::variant) or type erasure all perform the same relative types of instruction. Think about if you really need dynamic dispatch.

Inline Your Functions

Function inlining is another feature we can leverage to reduce jumps. In addition to jumping to a different location in our code and dealing with the costs of that, we also don’t have to pay the function call overhead. This overhead includes saving the register states, filling registers and/or the stack with the function arguments, incrementing the stack pointer, saving the stack return location and saving the current program control location. Function inlining puts a copy of the function body being called right at the call site. This can result in some great performance gains as there is no jump and the instruction pipeline is as simple as possible and spatially localized. You can use the GCC always_inline attribute above functions you want to be strictly inlined. For example:

[[gnu::always_inline]
int function(int a) {
  return a + 11232;
}

Even though this function would have most likely already been inlined due to its simplicity, this attribute will forcefully encourage the compiler to do so, provide a warning to us if it is unable to, and also encodes the inlining objective within the code itself for all programmers to see.

Compiler Hints

If branching is absolutely necessary in your code, you can use compiler hints to provide the compiler with extra information that it cannot infer. Using this information, the compiler will potentially re-order instructions in a way that will make the branch you marked as more ‘likely’ to be more efficient from a conditional branching perspective. This can be done using the __builtin_expect function or from C++20 and above with the likely and unlikely attributes. You can make the __builtin_expect function more ergonomic to use by creating LIKELY and UNLIKELY macros:

#define LIKELY(x) __builtin_expect(!!(x), 1)
#define UNLIKELY(x) __builtin_expect(!!(x), 0)

Let’s explore a simple use case and it’s effects. Take a look at this simple function, and it’s corresponding assembly:

C++:

void func(int input) {
  if (input == 1) {
    // BRANCH ONE
  } else if (input == 2 ) {
    // BRANCH TWO
  } else {
    // BRANCH THREE
  }
}

Assembly:

func(int): # @func(int)
  movq %rdi, %rax
  leaq 1(%rdi), %rcx
  cmpl $2, %esi
  je .LBB0_3
  cmpl $1, %esi
  jne .LBB0_4
  # BRANCH ONE
  retq
.LBB0_3:
  # BRANCH TWO
  retq
.LBB0_4:
  # BRANCH THREE
  retq

Notice that the cmpl instruction is first executed with ‘2’ as the argument. Only after, if we didn’t jump, it will execute with ‘1’. What if we know that the value in $esi is more likely to be equal to ‘1’ and we want to prioritize that branch? Let’s add a LIKELY annotation to our C++ code:

void func(int input) {
  if (LIKELY(input == 1)) {
    // BRANCH ONE
  } else if (input == 2 ) {
    // BRANCH TWO
  } else {
    // BRANCH THREE
  }
}

Now we can observe the different in output assembly:

func(int): # @func(int)
  pushq %rbx
  cmpl $1, %esi
  jne .LBB0_1
  # BRANCH ONE
  retq
.LBB0_1:
  cmpl $2, %esi
  jne .LBB0_4
  # BRANCH TWO
  retq
.LBB0_4:
  # BRANCH THREE
  retq

As you can see, the compiler re-arranged the branching instructions. The comparison of %eri with 1 is now part of the first call to cmpl + jne and if equality is determined we don’t jump anywhere else. Spatially, we are already at the BRANCH ONE code. The compiler also moved the remaining conditionals after the first jump at .LBB0_1:. It is quite apparent that all remaining cases are deprioritized from an optimization standpoint in comparison to BRANCH ONE.

Now for some semantic rambling. Even though they have been accepted as standard terms, likely and unlikely are not really accurate from a definition perspective. You are not really marking that one conditional branch is more likely than the other to occur. What you are doing is telling the compiler that you want to optimize for this conditional branch. Perhaps OPTIMIZE and UNOPTIMIZE are more accurate terms for our construct. Making your code as readable/understandable/surface-level as possible is always something worth pursuing.

Cheaper Branching

Not all ‘branching’ instructions are created equally. Consider the following function that that performs a few compare and jumps:

void processThisString(std::string_view input)
{
  if (input == "production") {
    processProd(input);
  } else if (input == "RC") {
    processRC(input);
  } else if (input == "beta")
    processBeta(input);
  }
}

This is a ‘run-of-the-mill’ if statement. However, we have an opportunity here: the comparison set is very constrained. In fact, most the calls to the == operator distill to a memcmp which will short circuit after the first character, since all the first characters in the string literals are different. For example, if the input’s value is “beta”, we will perform two calls to == for nothing. Instead, we can re-write this in a more efficient manner by using the switch construct on that first character:

void processThisString(std::string_view input)
{
  constexpr auto i = 0;
  switch (input[i]) {
    case "production"[i]: processProd(input); break;
    case "RC"[i]: processRC(input); break;
    case "beta"[i]: processBeta(input); break;
  }
}

The compiler will most likely generate a more efficient branching construct using jump tables and binary decision trees since data of only 1 char width is being compared now. A best-case scenario would be if the range of first character values is very limited, because then a single jump table would be used to increment the program control pointer. Of course, this is not cut and dry, because a small set of inputs may cause an if statement to perform better. Always measure! Regardless, this is a good approach to keep in mind for optimization.

Chapter Summary

  • Jumping to distant code locations or branching can incur a heavy cost in low latency segments of the code. This is due to added instructions, instruction pipeline de-optimization and cache eviction.
  • Examples of Branching/Jumping include: if statements, chained conditionals, virtual method calls, and function calls in general.
  • You can avoid branching/jumping by using branchless conditionals, Using ‘CMOV’ instructions, Remove the use of virtual functions and try to inline your functions.
  • If branching is unavoidable, try to use a ‘cheaper’ variation of it.

Related post