Truncating String White-space At Compile Time in C++

One problem that arises when interleaving SQL queries in C++ code is string literal formatting and spacing. Most queries are much more human-digestible in a multi-line format. C++ treats adjacent string literals as one, so the traditional C++ solution is this:

const char* query =
    " SELECT "
        " u.id, "
        " u.user_name, "
        " u.ref_id, "
        " u.postal_code, "
        " u.email, "
        " o.transaction.id "
    " FROM "
        " users u "
    " JOIN "
        " orders o ON o.user_id = u.id "
    " WHERE "
        " u.id=? AND u.active=? ";

A programmer has to be very careful with the white-space before and after the SQL tokens as their presence or lack-thereof could be the difference between a good night’s sleep, or a run-time exception triggering a on-call alert late at night. Fun! Overall this code snippet is much more difficult to read due to the syntactical mess of quotation marks around it. To help prevent issues I have even heard this old platitude from a veteran C++ programmer: “A good surgeon washes his hands before and after the surgery”; mandating spaces at the beginning and end of each line. Is a rule of thumb really a best practice? One’s best work is done with as little distraction as possible, and this is something else to worry about.

Modern C++ provides us with one useful addition that can help mitigate this issue: the raw string literal. Leveraging this feature we can write the query naturally; all within one pair of quotations:

const char* query = R"(
    SELECT
        u.id,
        u.user_name,
        u.ref_id,
        u.postal_code,
        u.email,
        o.transaction.id
    FROM
        users u
    JOIN
        orders o ON o.user_id = u.id
    WHERE
        u.id=? AND u.active=?
)";

This works rather well. You can define the same exact query, presented very simply and clear. You don’t have to ‘worry’ about the C++ surrounding the SQL. You can just focus on the SQL. The problem is, all those spaces, tabs and newline characters are also in the final query string. The raw string is 298 characters long versus the original 166 characters. This means you pass a query string that is almost double in length to any query parsing function/module. Now, with the performance of modern hardware this penalty is negligible. However, sometimes, this code is ran in a hot path or you might have a sense of guilt for such an obvious efficiency sacrifice. Let’s make things right in the universe…

The first, conceptually simplest, obvious solution is to run a string-slimming function on all queries when program execution starts. This approach is good enough, but all sub-approaches in this branch have disadvantages:

  • If you want the query to be ready before a certain function is called, you will unfortunately have to move the query variable out of it’s logical code block. This can reduce the clarity of intent and logic locality.
  • If you want to keep the query nested where it’s used, you will have to devise some kind of static initialization function that runs the first time the function is called. A static flag will have to be checked every time the function is called to determine whether the string has been ‘slimmed’ yet.
  • You will need to allocate a new destination buffer for the ‘slimmed’ query.

As suggested by this article’s title we can actually do this string processing at compile time, with no penalty to our code’s logic or clarity.

‘contexpr’ Variables, Functions and Classes

Modern C++ provides the constexpr functionality to provide compile-time code utilities. There are a lot of resources available online detailing this, so I won’t go into detail. Fundamentally, you can mark your functions, classes and variables with the constexpr keyword to signify they could potentially be called or created at compile time. This is not a guarantee, but a contract of sorts. A summary of the rules:

  • A constexpr variable must be initialized as a constexpr type with a constepxr function. These must be constructed/called with either literals or other constexpr variables as arguments.
  • A constexpr function can only take and return constexpr types. The key thing to know is that arguments can be accepted as constexpr, but not passed on as constexpr. Once within the function you cannot guarantee the argument values are known at compile time, even though they might be.
  • A constexpr type/class can only have constexpr type members and must at least one constexpr constructor. All scalar types and arrays are considered constexpr.

I’m fairly new to constexpr myself, so please do your own research, and reach out if I’m incorrect.

Implementing a Compile Time String

In order to perform our string parsing/creation at compile time, we need to implement a string class that satisfies the constexpr ‘contract’ for types. We must ensure that all methods we want to call after the constexpr variable construction are marked as ‘const’. The implementation is pretty self explanatory when keeping in mind the restrictions:

namespace compiletime {

template<std::size_t MaxSize = 30>
class string
{
    char m_data[MaxSize] = { 0 };
    std::size_t m_size;
public:
    constexpr string() : m_data({}), m_size(0) {}
    constexpr string(const char* str) : m_data(), m_size(0) {
        for(int i =0; i<MaxSize; ++i) {
            m_data[m_size++] = str[i];
        }
    }
    
    constexpr char const* data() const { return m_data; }
    constexpr operator const char*() const { return data(); } // for convenience
    constexpr void push_back(char c) { m_data[m_size++] = c; }
    constexpr char& operator[](std::size_t i) { return m_data[i]; }
    constexpr char const& operator[](std::size_t i) const { return m_data[i]; }
    constexpr size_t size() const { return m_size; }
    constexpr const char* begin() const { return std::begin(m_data); }
    constexpr const char* end() const { return std::begin(m_data) + m_size; }
};

}

2. Implementing a Compile Time Parser

Now that we have a string class that can used with constexpr variables, we can run constepxr functions that take this string type as arguments. Remember, in constexpr functions all arguments can be accepted as constexpr, but not passed as constexpr. Additionally, we can only use other constexpr functions and types within the function.

First, let’s implement a simple function which checks if a provided character is a whitespace character:

constexpr bool is_whitespace(char c) {
    return
        (c == ' ') ||
        (c == '\t') ||
        (c == '\n') ||
        (c == '\v') ||
        (c == '\f') ||
        (c == '\r');
}

Notice the constexpr keyword as well as the fact that all types used (char) are constexpr-friendly. The actual ‘business logic’ is self-explanatory.

Moving on and focusing on our goal, the function we want to implement takes one compiletime::string, iterates over it, and removes consecutive white-space and newlines. This can be done via a simple for-loop, with push_back calls appending the preserved characters to a new compiletime::string instance. A few predicates allow us to keep track of the parser state. Check it out:

template<std::size_t N>
constexpr auto truncateWhitespace(compiletime::string<N> str) {
    // Need to use non-type template for string max size
    compiletime::string<N> result;
    bool previousIsWhitespace = false; // Keep track if the previous character was whitespace
    for(char c : str) {
        // Skip new lines
        if(c == '\n') {
            continue;
        } else if(is_whitespace(c)) {
            // If the last character was whitespace, continue interation
            if(previousIsWhitespace) {
                continue;
            }
            // Whitespace: Set flag
            previousIsWhitespace = true;
        } else {
            // Not whitespace: Reset flag
            previousIsWhitespace = false;
        }

        result.push_back(c); // Otherwise; add character to new string
    }
    return result;
}

And finally, we can provide a simple overload to allow direct string literal use:

template<std::size_t N>
constexpr auto truncateWhitespace(const char (&str)[N])
{
    compiletime::string<N> tmp(str); // build instance
    return truncateWhitespace(tmp); // run function
}

The Spoils

Now, we can actually use this construct in our code:

constexpr auto query = R"(
    SELECT
        u.id,
        u.user_name,
        u.ref_id,
        u.postal_code,
        u.email,
        o.transaction.id
    FROM
        users u
    JOIN
        orders o ON o.user_id = u.id
    WHERE
        u.id=? AND u.active=?
)";

constexpr auto trucatedQuery = truncateWhitespace(query);
std::cout << trucatedQuery;
// output:
// " SELECT u.id, u.user_name, u.ref_id, u.postal_code, u.email, o.transaction.id FROM users u JOIN orders o ON o.user_id = u.id WHERE u.id=? AND u.active=? "

The new query length is 154 characters! Which beats obviously beats the original raw literal (294) and the old “spaces before and after” mantra with traditional one-line string literals (162). And we can validate that this is actually happening at compile time with a static_assert. If you are using a modern IDE/Editor with C++ integration the following statement will actually be highlighted as a failed assertion before running compilation:

static_assert(trucatedQuery.size() == 154);

We have succeeded and in the process gained advantages over both previous approaches:

  • Clearer and less cluttered code site: Freedom to use raw string literals to format queries with as much indentation as you’d like.
  • Neater, Compact Logging: New lines scattered in log files make them harder to parse and understand.
  • Less Network Bandwidth Usage: If you are using a database that directly accepts the query string, you will be sending half the bytes over the network.
  • Faster Performance: If you do any local string validation/processing/formatting, you will gain some performance.

For the sake of it, I did a simple benchmark comparing the generated compile_time::string buffer to a const char* with the exact same literal. On my system, iterating over the provided string using a simple pointer and while loop resulted in consistently faster performance averaged over many iterations. I would be greatly concerned if it didn’t, as there is now half as many characters as before! A simple O(N) iteration operation will obviously be faster on less elements.

I can see this technique providing excellent payoff for queries with many joins and deeper indentation (maybe in a lambda inside a class member function or something), sent over the network and parsed on a regular, frequent basis. It’s also just feels great, knowing you’ve reached the holy grail of performance; aka compile time operations.

Limitations

There is one obvious limitation of this approach. Even though the ‘string’ value is shortened, we pack a buffer of the original size into the executable, padded with zeroes at the end of the string. This is because the final string size is needed as a compile time template value to construct the string class. Unfortunately, once we’re inside the truncateWhitespace function, we can’t retrieve the length to use with the string type. We also can’t make any assumptions about what the final size will be as there may not be any superfluous white-space (A simple N / 2 estimate could result in compilation failures in some cases). You could solve this this by implementing a constexpr function to calculate the final length before doing the actual truncation. Something like this:

constexpr const char originalQuery[] = R"(...)";
constexpr std::size_t slimmedSize = calculateTruncatedWhitespaceSize(originalQuery);
constexpr auto trucatedQuery = truncateWhitespace<slimmedSize>(originalQuery);

I cannot currently think of a way to do this in a way does not leverage a macro, as we can’t create function to wrap this functionality, since the arguments would not be considered constexpr inside and therefore could not be used calculate a constexpr length to be provided as a non-type template parameter to the string class. Please let me know if you can suggest a technique or trick!

Another limitation is using string literals within the SQL itself. Luckily, this can be mitigated by enhancing the parser to keep track if the current point of iteration is within a string literal or not.

In Closing

Though the performance gains are negligible in most cases, this approach provides real benefits in clarity, logging and efficiency. You can leverage this same technique to do any kind of string pre-processing at compile time as needed.

Link to Complete Code