Calculating String length and width –
Fun with Unicode
Let's calculate how string length in Rust¹! How many characters there really are there and how much space these strings take up when displayed.
¹: This article may also apply to other languages. This article will only focus on strings with the Rust default UTF-8 encoding. There's an appendix for how it works in Ruby at the end of the article. I'm simplifying this content to keep the article short.
String.len()
The first function you'll probably come across is String.len()
/str/len()
, or length of string. Given the string "abc"
it will returns the length of three. All looks good so far.
"abc".len() // => 3
That is, until we take a closer look at the docs for this function. It says the following:
Returns the length of this string, in bytes, not
char
s or graphemes. In other words, it might not be what a human considers the length of the string.Source:
String.len()
The Rust docs are giving us a warning here that it may not always return the number we'd expect. It will return the string length in bytes, and it sounds like not all characters are counted as one byte.
Let's try something that's not just plain "a" through "z", but something like a character with an accent.
"é".len() // => 2 bytes
We can see here that the result is a larger number than what we consider the string length to be. A lot of characters are comprised of multiple bytes. There are only so many characters we can make from an eight number byte, this is what ASCII is. To support all characters of all languages in the world in 256 possible different bytes wouldn't fit. Let's try another approach.
Chars.count()
Rust has a built-in "Chars" module we can use to split up the string into a list of characters, this gives a more accurate result.
"abc".chars().count() // => 3 characters
"é".chars().count() // => 1 characters
When we ask the str.chars()
method to give us a breakdown of what it considers characters we get a pretty good result. The character with the accent is seen as one character.
"é".chars() // => Chars(['é'])
We've learned that characters can be composed of multiple bytes. Relying on the string byte length for the actual length is not accurate enough.
Will Chars always return an accurate string length? This depends on your use-case. If you're looking for the number of characters in a string, probably yes, but also no. If you're looking for the actual string display width as rendered, the size it takes up on screen, then very much no.
If we look at the documentation again it will give us another warning:
It's important to remember that
char
represents a Unicode Scalar Value, and might not match your idea of what a 'character' is. Iteration over grapheme clusters may be what you actually want.Source:
str.chars()
Graphemes
What if the string being checked doesn't only contain numbers and letters, accents or not? Emojis are very popular nowadays and present in every kind of string. Emojis can have a byte size of much more than two bytes, but also consist of multiple characters even though it looks like one object.
// Person emoji
"🧑".len() // => 4 bytes
"🧑".chars().count() // => 1 characters
// Woman scientist emoji
"👩🔬".len() // => 11 bytes
"👩🔬".chars().count() // => 3 characters
// Family: Man, Man, Girl, Boy emoji
"👨👨👧👦".len() // => 25 bytes
"👨👨👧👦".chars().count() // => 7 characters
The "woman scientist" emoji shown above takes up eleven bytes and three characters. The family takes up 27 bytes and seven characters, even though we only see one item in the string.
If we print the list of characters we'll get a better idea of what is happening.
"👩🔬".chars()
// => Chars(['👩', '\u{200d}', '🔬'])
"👨👨👧👦".chars()
// => Chars([
// '👨',
// '\u{200d}',
// '👨',
// '\u{200d}',
// '👧',
// '\u{200d}',
// '👦'
// ])
Emoji can consists of multiple characters, that together construct another emoji and tell computers what to render. In the example above, we first see the "woman" emoji, then what is called a "Zero Width Joiner" character, and finally the microscope emoji. For the family we see the whole list of different genders and ages joined together in the same way.
The Zero Width Joiner character used by these combined emoji is, as the name describes, a Unicode character with zero width (which makes it invisible). This character is used to join together multiple emoji to make a new one.
These additional emoji characters do mess up our string length calculation. Luckily we can use the unicode-segmentation Rust crate to do more accurate character counting. We can ask the crate to split the string based on something called Graphemes and return the list of characters. A grapheme cluster in this context is a group of Unicode codepoints that consist of "user-perceived character", what we consider a character when we type it.
use unicode_segmentation::UnicodeSegmentation;
"abc".graphemes(true).count // => 3
"é".graphemes(true).count // => 1
"🧑".graphemes(true).count() // => 1
"👩🔬".graphemes(true).count() // => 1
"👨👨👧👦".graphemes(true).count() // => 1
(The true
argument given to graphemes
function means we want to count it using the Unicode extended grapheme clusters and not the legacy clusters. Let's use the non-legacy cluster.)
Finally! We have an accurate string length. Right?
Yes. For the scenarios in which you want to know the number of characters in a string, I'd say yes.
But...
Graphemes widths
As you can see in the code example below, emoji aren't shown one character column wide. Most emoji are rendered with a display width of two columns, two other Latin characters, like A through Z. For illustration purposes I'll be using a monospaced font so that the latin characters have a predicable width.
// These two lines should render with the same width
"ab"
"🧑"
While the graphemes-count-method (shown in the previous section) will give you a correct string length in terms of how many characters you see, it's not always same as the horizontal space in columns a string takes up.
In my Lintje project (a Git linter) I ran into this string length vs display width problem for the program's rules and output formatting. The terminal output didn't align properly when a string in a Git commit message included a emoji. In the example output below, the ^
marker should be aligned with the last character of the string, but with an emoji in the string it wouldn't be properly aligned.
# Badly aligned output
String with a ❌ emoji
^ End of sentence
# Well aligned output
String with a ✅ emoji
^ End of sentence
For rules that checked string length, I decided to count the display width of how every character is rendered as the string length. If a string has a max length of 50 characters, because of readability and width constraints, it should not be allowed to jam that string full with 50 emoji. It would take up double the horizontal space of a string with the same length without emoji.
Using the unicode-width Rust crate we can calculate a more accurate string display width.
UnicodeWidthStr::width("🧑") // => 2
UnicodeWidthStr::width("👩🔬") // => 4
// Consists of 👩, a Zero Width Joiner and 🔬
UnicodeWidthStr::width("👨👨👧👦") // => 8
// Consists of multiple face emoji and Zero Width Joiner characters
That is, the display width that every character is defined as being part of each grapheme cluster in Unicode. As you can see, the emoji with a Zero Width Joiner characters are as wide as the number of emoji they join times two.
There are also "emoji" that only return a display width of one, not two, columns. These one column display width characters are considered to have a "narrow" display width. The two column wide characters are "wide". This difference is something you can see if your terminal and other applications, as it sometimes overlaps with the next character. This will not be visible in the browser you're reading this in.
"❤️".len() // => 6 bytes
"❤️".chars().count() // => 2 characters
"❤️".chars() // => Chars(['❤', '\u{fe0f}'])
UnicodeWidthStr::width("❤️") // => 1 display width
The heart emoji ❤️ is a "Heavy black heart" character ❤ from the dingbat collection combined with what's called a "Variation Selector-16" character. This is another invisible character in graphemes clusters to indicate certain characters should be displayed as their emoji counterparts.
And the actual width is?
Why is this happening? From what I understand the unicode-width library is following the Unicode reference to the letter, counting only the base character's display width and not the emoji variation. That results in output that I would not have expected. It's not something that can be fixed, unless you change something in Unicode first.
Unfortunately I don't have another function or library to call on to fix this and return what we see as the actual display width.
What I did in my Lintje project was first split up the string into graphemes clusters, then scan every sub string for Zero Width Joiner characters, and then only return "two" as the width. This is somewhat more accurate to what we see displayed, but it's not 100% accurate. It's good enough for my purpose right now. There probably are better ways out there to do this, but I fear that it will mean maintaining a list of the display width of all emojis. At least the exceptions with a display width of one. Let me know if you know of a better solution!
// Very simplified example
let string = "👨👨👧👦";
let mut width = 0;
let unicode_chars = string.graphemes(true);
for character in unicode_chars.into_iter() {
if character.contains("\u{200d}") {
width += 2;
} else {
width += UnicodeWidthStr::width(character);
}
}
width // 2
The full implementation with additional checks for other modifier characters can be found in the Lintje GitHub project's utils module.
In conclusion
We've learned that what we humans consider string length is not the byte length. Characters can be multiple characters joined together, like emoji. Not all characters have the same display width. Some characters take up more horizontal space than others and some don't even take up any horizontal space.
So what should you use to calculate string length or display width?
- Do you want the length of a string in bytes? Use
String.len()
. - Do you want the number of characters in a string? The
Chars.count()
method is an option, but I suggest the next solution. - Do you want the visible number of characters in a string? Use
str.graphemes(true).count()
. - Do you want the somewhat accurate width of the string? Use
UnicodeWidthStr.width(str)
. - If you're like me an what the string display width calculated more accurately? You'll have to write something yourself I'm afraid.
Appendix: Ruby
I write a lot of Ruby code, so here's how the above applies to Ruby, as much as I looked into it.
When calling Ruby's String#length
, it returns the length of characters like Rust's Chars.count
. If you want the length in bytes you need to call String#bytesize
.
"abc".length # => 3 characters
"abc".bytesize # => 3 bytes
"é".length # => 1 characters
"é".bytesize # => 2 bytes
Calling the length on emoji will return the individual characters as the length. The 👩🔬 emoji is three characters and eleven bytes in Ruby as well.
"👩🔬".length # => 3 characters
"👩🔬".bytesize # => 11 bytes
Do you want grapheme clusters instead? You're in luck: it's built-in to Ruby with String#grapheme_clusters
.
"👩🔬".grapheme_clusters.length # => 1 cluster
To calculate the display with, we can use the unicode-display_width gem. The same multiple counting of emoji in the grapheme cluster still applies here.
require "unicode/display_width"
Unicode::DisplayWidth.of("👩🔬") # => 4
Unicode::DisplayWidth.of("❤️") # => 1