Here’s the simplest way to efficiently trim a UTF-8 string to the specified number of bytes:
public static string TrimToByteLength(this string input, int byteLength)
{
if (string.IsNullOrEmpty(input))
return input;
var currentBytes = Encoding.UTF8.GetByteCount(input);
if (currentBytes <= byteLength)
return input;
//Are we dealing with all 1-byte chars? Use substring(). This cuts the time in half.
if (currentBytes == input.Length)
return input.Substring(0, byteLength);
var bytesArray = Encoding.UTF8.GetBytes(input);
Array.Resize(ref bytesArray, byteLength);
var wordTrimmed = Encoding.UTF8.GetString(bytesArray, 0, byteLength);
//If a multi-byte sequence was cut apart at the end, the decoder will put a replacement character '�'
//so trim off the potential trailing '�'
return wordTrimmed.TrimEnd('�');
}
Code language: C# (cs)
A UTF-8 string can have a mix of characters between 1 to 4 bytes. When you only take part of the byte array, you may end up cutting multi-byte chars in half, which then get replaced with the replacement character ( ‘�’ ) when you decode. This is why it’s removing the last character.
There are other approaches, such as looping and checking for invalid multi-byte sequences yourself, but that leads to code that is harder to understand and isn’t more efficient (according to performance benchmarks with 1 million character strings). Furthermore, one of the best optimizations you can do is use string.Substring() if you’re only dealing with 1-byte chars. That leads to a 2x speedup.
In this article, I’ll go into more details about how to deal with multi-byte chars that are cut in half. At the end, I’ll show all of the unit tests used to prove the TrimToByteLength() method works.
Table of Contents
Dealing with a multi-byte char that was cut in half
If you only have to deal with 1-byte chars, trimming the byte array wouldn’t be a problem. In fact, if that were the case, you could just use string.Substring() instead of encoding/decoding.
But UTF-8 encoded characters can have between 1-4 bytes. Since you’re trimming based on byte length, you may end up chopping part of a multi-byte char in half.
For example, let’s say you have the following string with a Japanese character “か”. In UTF-8, this is a multi-byte char with the following three bytes:
11100011 10000001 10001011
Code language: plaintext (plaintext)
Now let’s say you’re trimming this to only 2 bytes. This would leave the first two bytes:
11100011 10000001
This is an invalid sequence, and by default the decoder would replace it with the replacement character ‘�’.
Any code that is trying to trim a string to a specified byte length has to deal with this problem. You can either try to detect the invalid multi-byte sequence yourself by reversing through the byte array and examining bytes, or you can let the decoder do the work for you, and simply remove the replacement character at the end. The code shown in this article is doing the latter approach, because it’s far simpler to not reinvent the wheel.
How is the invalid multi-byte sequence detected?
UTF-8 was designed to be able to determine which character a byte belongs to using the following scheme:
1st byte starts with | 2nd byte starts with | 3rd byte starts with | 4th byte starts with | |
1-byte char | 0 | |||
2-byte char | 110 | 10 | 10 | |
3-byte char | 1110 | 10 | 10 | |
4-byte char | 11110 | 10 | 10 | 10 |
The first byte in the sequence tells what kind of sequence this is, which tells you how many continuation bytes to look for. Continuation bytes start with 10.
Let’s go back to the byte array with the Japanese character “か”:
11100011 10000001 10001011
When this is trimmed to 2 bytes:
11100011 10000001
When the decoder is going through this, it sees that the first byte in the sequence starts with 111, which means it’s dealing with a 3-byte sequence. It expects the next two bytes to be continuation bytes (bytes that start with 10), but it only sees the one continuation byte (10000001). Hence this is an invalid byte sequence and it is replaced with the replacement character ‘�’.
More examples of characters and their UTF-8 byte sequences
Here are more examples of characters and their byte sequences.
Character | Unicode | Byte sequence |
a | U+0061 | 01100001 |
Ć | U+0106 | 11000100 10000110 |
ꦀ (Javanese character) | U+A980 | 11101010 10100110 10000000 |
? (Sumerian cuneiform character) | U+12003 | 11110000 10010010 10000000 10000011 |
Notice the pattern in the byte sequences. The first 4 bits of the first byte tell you want kind of sequence it is, followed by continuation bytes (which all start with 10).
Unit tests
The TrimToByteLength() method was tested using the following parameterized unit tests. This exercises every scenario, including verifying what happens when multi-byte sequences are chopped apart.
[TestClass()]
public class TrimToByteLengthTests
{
[DataRow(null)]
[DataRow("")]
[TestMethod()]
public void WhenEmptyOrNull_ReturnsAsIs(string input)
{
//act
var actual = input.TrimToByteLength(10);
//assert
Assert.AreEqual(input, actual);
}
[DataRow("a")] //1 byte
[DataRow("Ć")] //2 bytes
[DataRow("ꦀ")] //3 bytes - Javanese
[DataRow("?")] //4 bytes - Sumerian cuneiform
[DataRow("a?")] //5 bytes
[TestMethod()]
public void WhenSufficientLengthAlready_ReturnsAsIs(string input)
{
//act
var actual = input.TrimToByteLength(byteLength: 5);
//assert
Assert.AreEqual(input, actual);
}
[DataRow("abc", 1, "a")] //3 bytes, want 1
[DataRow("abĆ", 2, "ab")] //4 bytes, want 2
[DataRow("aꦀ", 1, "a")] //4 bytes, want 1
[DataRow("a?c", 5, "a?")] //6 bytes, want 5
[DataRow("aĆ?", 3, "aĆ")] //7 bytes, want 3
[TestMethod()]
public void WhenStringHasTooManyBytes_ReturnsTrimmedString(string input, int byteLength, string expectedTrimmedString)
{
//act
var actual = input.TrimToByteLength(byteLength);
//assert
Assert.AreEqual(expectedTrimmedString, actual);
}
[DataRow("Ć", 1, "")] //2 byte char, cut in half
[DataRow("ꦀ", 2, "")] //3 byte char, cut at 3rd byte
[DataRow("ꦀ", 1, "")] //3 byte char, cut at 2nd byte
[DataRow("?", 3, "")] //4 byte char, cut at 4th byte
[DataRow("?", 2, "")] //4 byte char, cut at 3rd byte
[DataRow("?", 1, "")] //4 byte char, cut at 2nd byte
[DataRow("a?", 2, "a")] //1 byte + 4 byte char. Multi-byte cut in half
[TestMethod()]
public void WhenMultiByteCharSequenceIsCutInHalf_ItAndReplacementCharAreTrimmedOut(string input, int byteLength, string expectedTrimmedString)
{
//act
var actual = input.TrimToByteLength(byteLength);
//assert
Assert.AreEqual(expectedTrimmedString, actual);
}
}
Code language: C# (cs)