It’s been a little while since my last high-performance post, but my use of the techniques and features continues! In this post, I want to present a more practical example which I hope will help to illustrate a real-world use case for some of the new .NET Core performance focused API changes.
For a service that I’m currently working on, in a few places we need to generate unique identifiers which are included on some documents we index to Elasticsearch. We’re doing this at scale for about 18 million new documents every day. A recent new requirement is to generate and store four additional identifiers. A design goal here was to limit the increase of the document storage size by reducing the length of the unique identifier strings that will be included in the document.
Benchmarking the Original Code
During the initial proof of concept stage, I’d used some quick and dirty code to get the base64 encoded string…
Convert.ToBase64String(_guid.ToByteArray()).Replace("/", "-").Replace("+", "_").Replace("=", "")
This code is borrowed from this example posted in a popular StackOverflow question about base64 encoding GUIDs.
This code first gets a byte array from the GUID. This array is passed to the Convert.ToBase64String method which returns an encoded string. Since we need our identifiers to be safe for use in URL query strings, this code also replaces any unsafe characters using the Replace method. Finally, it removes any trailing padding characters. When based64 encoding a GUID, we’re guaranteed to get two padding characters ‘==’ at the end of the string.
This code works and it does the job, but it allocates a fair amount in the process. There’s the creation of a byte array and the calls to Replace that we should be wary of. Replace, for example, causes a new string to be allocated when replacing characters because strings are immutable. For general cases, this might be okay, but as we take our service from a proof of concept stage to an alpha/beta state, we know the request volumes will increase. This service needs to handle 100’s of requests per second if we’re to avoid heavy scaling.
Therefore, I set about writing a new approach to base64 encode GUIDs, which should hopefully be faster, but most importantly reduce the number of bytes allocated.
Before starting work, I took the first important step, which was to benchmark the existing code. If you want to learn more about benchmarking, you can read my previous benchmarking blog post.
Here’s the benchmark code:
In this case, I directly exercise the original code as a baseline benchmark.
After running the benchmark, the summary results are as follows:
With the benchmark results in hand, my goal was to improve upon the 192 bytes allocated per iteration.
As an aside: I ran the benchmark a few times. Curiously, on some occasions, I was getting different allocation counts.
This initially looked like a problem with how I was exercising the code or running my benchmark. My first conclusion when anything unexpected occurs is that I’ve screed up somewhere! In this case, I realised that it might perhaps be the behaviour of the Replace method. There’s no guarantee that I’ll always have unsafe characters to replace in the base64 encoded string. Looking at the corefx source code on GitHub for the Replace method, I noticed that indeed, if the character being replaced is not found in the string, then the original string is returned, unchanged. This is an optimisation which avoids new string allocations unless there’s a need for them. Mystery solved! 192 B seemed to be the smallest allocation size I noted during my benchmarking, with the worst case being 352 B.
Optimising the Code
Let’s take a look at the final version of the code that I ended up with (at least for now). Spoiler alert, this has fewer allocations! We’ll see the results soon. This is the complete code, don’t worry, we’ll step through it line-by-line to understand what it does.
This method is defined as an extension method on the GUID type for ease of use.
The first thing this code does is to stack allocate two byte arrays. This makes use of the Span<T> feature which allows us to safely reference stack memory. Both of my arrays are very small and can safely be stored on the stack. The advantage of this is that there’s no object creation on the heap.
This is not only fast, but it also avoids the need for garbage collection (GC) to clean up these objects. While they would be short-lived, small arrays on Gen-0, when this method is called 18 million times per day, that’s 36 million small objects we can avoid creating!
The next line is:
MemoryMarshal.TryWrite(guidBytes, ref guid);
Initially what this achieves is to get the raw bytes of the GUID. MemoryMarshal is a class which provides methods designed to work with the new Span<T> and Memory<T> types (plus their read-only variants). In this case, I’m using its TryWrite method. This takes a struct, by reference, and attempts to write its bytes into a provided Span<byte>. A GUID is a 128-bit integer which occupies 16 bytes. That’s why I stack allocated a byte array of 16 elements for this method to write those bytes into.
The next line is:
Base64.EncodeToUtf8(guidBytes, encodedBytes, out _, out _);
This line allows us to encode the binary bytes for the Guid into UTF-8 encoded text, represented as base 64. The overload I’m using works with Spans so I can provide my Span<byte> buffers to it. This is good since it means there’s no allocation involved. Typical methods for base 64 encoding data would return the encoded string which creates an allocation. Since I have further processing to do, I want to avoid surfacing an intermediate string at this stage.
The method first accepts a ReadOnlySpan<byte> which is the input data to be encoded. Here, that is the 16 bytes we wrote out from the value of the GUID. The second argument is a Span<byte> where the encoded bytes will be written. Since I know that a GUID will always be encoded to 24 bytes, I can size my stack allocated byte array to the correct capacity. This method also provides two out parameters indicating the bytes consumed and written. I discard these since I’m confident here of the effect of encoding my 16 GUID bytes.
Since my unique ID may need to be transmitted between services over HTTP, potentially in a query string, I want to ensure that it is URL safe. It’s possible that some unsafe characters will appear when base64 encoding a GUID. The next loop in the code is responsible for replacing any unsafe characters.
For reasons that we’ll come to very soon, I only care about the first 22 bytes of the encoded data. I loop through these and individually check if the byte at the current index contains either a forward slash or a plus. In those cases, I replace them with the bytes of safe alternative characters, a dash or underscore respectively.
Finally, we create the encoded string:
var encodedString = Encoding.UTF8.GetString(encodedBytes.Slice(0, 22));
We can use the GetString method on the UTF8 encoding which accepts a ReadOnlySpan<byte> as one of its overloads. The reason I’m slicing my Span<byte> to a length of 22 bytes is that base 64 encoding a GUID will always append two padding characters of ‘=’. In our case, we can trim these off to save an extra couple of bytes in our stored document. The change that I’m making adds four of these new identifiers into each document. Since we store 18 million documents a day, that’s an extra 144 Mb per day that we can avoid storing into Elasticsearch. Data storage is cheap, but this has enough value at this scale to be worth the effort.
Benchmarking the Improved Code
We should never assume that the changes we’ve made do what we expect and perform better than the original code, so the final step is to update the benchmarks to compare the original version against this optimised one.
Here is the result of running these benchmarks:
In this case, the mean execution time of the original code is 202 ns, with 272 bytes allocated. As we saw earlier, the allocations vary between benchmark runs, because some GUIDs may not require as many string replacements as others. The 272 B figure is actually a fair number to compare against as it’s the middle value that I saw across all of my benchmarking runs.
The mean execution time for my optimised code is 86 ns, 2.35x quicker. That’s great, although it’s not the main number I was looking to improve. My goal was to reduce unnecessary allocations. Here, the new code is now allocating 72 bytes.
At first glance, that may suggest we have work to do. In reality, we don’t, since we require a string as the output of the method. Consumers expect the string value, so we have to allocate that at some point. In this case, on my machine, the base 64 encoded string here is 72 B. That means that this new encoding method has no allocation overhead. If we compare the improvement against the original method, the original code allocates (in this benchmark result) an additional 200 bytes per call. Multiplied by our expected daily call count of 18 million, that’s 3.6 GB of allocations per day, per identifier. For all four identifiers, that’s 14.4 GB of allocations saved per day!
Allocations are quick, and the references short-lived, so hopefully they would only require a Gen-0 collection to clean them up. Still, that’s pressure on the garbage collector that we can remove and CPU cycles we can use for executing our code rather than performing a collection.
Update – 20-06-2019 – Even Faster Version
Only hours are posting this, Mark Rendle pointed out a further performance improvement. Whilst raw speed wasn’t the main goal of mine, Mark’s suggestion gives an extra boost we shouldn’t ignore.
Here’s the updated code:
The main change here is an optimisation to cast the bytes, when they represent URL safe characters, which in our case we know represent ASCII characters To do this, Mark added a Span<char> view over a stack allocation char.
The code now uses a switch statement which for the majority of bytes will cast them to a char and add them to the Span<char> at the appropriate index. In cases where the bytes of an unsafe character are found, we can instead use the safe alternative.
Finally, we can pass the chars to the string constructor.
And how does this perform? Here are my results, comparing the original version with Mark’s quicker option.
All in, this provides a good gain and is 1.4x quicker in my benchmark. It’s no more complicated than my original version so it’s a nice optimisation! Thanks, Mark!
I hope that this worked example proves useful. It builds on my previous posts, which until now have used more straightforward, trivial examples to illustrate the high-performance features. This is a small, but real-world use case of the optimisation techniques, possible thanks to types like Span<T>. We made use of some other classes and methods which helped avoid allocations as they work with Span<T> instances. What’s also clear is that while this code is longer and more verbose than the original code, it’s not too complicated. It can be clearly followed and by encapsulating it as an extension method, it is easily consumed.
Thanks for reading! If you’d like to read more about high-performance .NET and C# code, you can see my full blog post series here.