Posted on August 17, 2021 by Ville Tirronen

The 7 Assumptions You Probably Have About Strings

How Unicode wipes out most of our assumptions on how Strings actually work

We programmers mostly fly by the seat of our pants when it comes to writing simple stuff. For simple things, we have a strong set of assumptions instead of specific knowledge of how things work. These are assumptions, such as knowing that if b = a + 1, then b is greater than a or that if we malloc some buffer, we now have the requested amount of memory we can write on. We don’t go and look at the specifications for each and every small thing we do.

We do this because checking everything would slow us down. But, if we did check, we’d find out that we’re usually wrong in our assumptions. There are numeric overflows and then a + 1 might be a lot less than a. Sometimes malloc will give us a null instead of a buffer and then we’re hosed.

Usually, we have to be bitten by these issues before we update our assumptions even a little. And even then, we usually correct them in broad strokes. After having a nasty overflow bug, we might correct our assumptions on integers to “a + 1 is greater than a unless there’s a chance that a is a very big number”. And we work based on that instead of having any precise rules in our minds about how overflows work.

Adjusted assumptions are called experience. They make you faster and correct more often. However, we might relocate some stuff, like the proper handling of malloc, entirely from our internal category of ‘easy stuff’ to our internal category of ‘complex stuff’. And then we might actually go and look up how it works.

About Strings

For beginners, Strings are the archetypal example of ‘easy stuff’. Most likely, we learned letters and numbers as children, and these feel very familiar to us. Secondly, when learning to program, most of us have done lots of programming exercises using Strings because they are probably the only interesting pre-built data type in most languages. When programming with Strings, we feel quite confident about how they work. Thirdly, we might have a good number of assumptions related to functioning of some simple character set, like ASCII or ISO-8859-1. Either because we’re that old, or because our teachers were that old. Those were character sets of simpler times!

Univac 1050-II, 1964, the first computer to use ASCII.

Now, in reality, Strings are very complicated things. Contrast them, for example, to your usual Int, found in any language. We know and understand the representation (64 bits, two’s complement) ¹ and we understand its semantics (behaves like a number, except if too large or too small). For Strings, we used to know the representation (one byte per character, check the ASCII table for what character it is), but we hardly ever know the semantics. Our String could contain our customers name. It could contain a number, a bit of JSON, or even an SQL statement.

Strings are the ultimate Any-type and, chances are, that if there is no ready made representation for some item in a program, it will be stored and operated as a String. Regardless of whether you have dynamic or static types, this throws all type safety to the wind. And, to compound, many of the things we use Strings for are bloody dangerous, like SQL or HTML. And, for that reason, SQL injections and cross-site scripting lead the top vulnerability lists year after year.

But, at least we understand how Strings work, as you know, Strings? We know how to concatenate, change case and so on, right?

Unicode

Understanding Strings is lot harder now than it was around the year 2000. We have been transitioning to Unicode for a few decades now already, and its been several years since I’ve heard anyone complain that their characters weren’t being displayed right ².

While otherwise being awesome, Unicode effectively wipes out most of our ‘useful’ assumptions on how Strings really work, but we haven’t been very verbal about the situation. Unfortunately, many of us are probably still working with outdated assumptions on how Strings work. And to make matters worse, many of us no longer understand the memory representation of Strings either ³

Broken assumptions

Next, lets go through some of my old assumptions that I needed to throw out with the ISO-8859-1 character set. Surely, this is not an exhaustive list, but hopefully it is enough to kick the (Unicode) Strings out of your mental compartment of ‘simple things’.

A character is representable by single byte

In the olden days of ASCII, each character could be represented with seven bits, making it easy to size buffers and scan memory. With Unicode, this is a terrible assumption. Let’s walk through one arbitrary example to show why.

At some point, Wordpress devs were fighting to stop SQL-injections from happening. One example issue they were trying to fix was someone maliciously adding single quotes in the user input and rewriting their database with it. Something like this imaginary example:

select 1 from accounts
where user = '%s'
    and password = '%s'

↓↓ (User supplies “whocares’ or true –” as password)

select 1 from accounts
where user = 'Avery'
    and password = 'whocares' or true -- '
-- And now everyone can log in as Avery!

Now, the simplest imaginable way to solve this is to properly encode the single quote in the user input ⁴. That is, each single quote ' must be encoded as \', or backslash-single quote.

PHP devs then wrote the addslashes function, and everything was okay for a while. The only problem was that they did the escaping byte by byte and not character by character. The devs were also blind to the problem as they only worked with single-byte Unicode characters (mostly old ASCII). Then, someone figured out that if you fed the system a String like "뼧 or true -- " you’d get the SQL injection again.

To understand why lets look up how these characters are represented in Unicode:

code	character
`0xbf27`	`뼧`
`0xbf5c`	`뽜`
`0x27`	`'`
`0x5c`	`\`

What the addslashes actually did was to replace all the bytes with value 27 with bytes 5c 27. So, "뼧 or true -- " turned into "뽜' or true -- " and again, there were SQL-injections.

It is not hard to imagine other similar disasters.

String lengths are somewhat stable

In ASCII, many of the common String processing operations were invariant in regards to the length of the Strings. This is not so with Unicode. And, though this property is probably relevant only if you’re manually allocating buffers or trying to size up graphics, let’s look at few cases where String lengths change unexpectedly.

Firstly, to pick a common String operation as an example, does length(x) = length(toUpper(x)) hold for Unicode x? No, since Unicode has, among other things, ligature characters such as ﬁ, which expand two-fold to FI.

The second example concerns normalization. Since there are multiple code points for the same character, Unicode forces you to do normalization so that two users don’t, for example, end up with identical screen names. One would guess that normalization, or the process of picking up a canonical representation for some set of characters, would not affect the number of normalized characters, but it indeed does: the single character ﷺ expands 18-fold into صلى الله عليه وسلم.

So, it is probably better not to assume anything about lengths of Strings after any operation.

Upper and lowercase are somehow linked

We who lived with variants of ASCII tend make heavy use of upper and lower casing operations. Besides now being able to change the lengths of the Strings, there are some additional sharp edges. Most importantly, the old assumption that upper and lower case letters are in unique correspondence is lost.

With Unicode, converting String to uppercase can lose more information than merely what case the characters were in. For example, if you lowercase the Kelvin symbol K, you get an ordinary lowercase k back, with no way of converting it back. This has, surprisingly, a lot of relevance when doing case-insensitive comparisons, since toLower('K') == toLower('k') but toUpper('K') != toUpper('k').

Reason for calling them upper and lower case letters: Uppercase ones go to the ‘upper case’.

Space is 0x20

This assumption is still true. The byte 0x20 represents a space in Unicode. But so do U+2000, U+2001, U+2002 and many others, including a zero width space character U+FEFF. Whitespace is special. We can’t allow screen names like “TheAlex” and “TheAlex” at the same time because HTML will not show that whitespace and other users wouldn’t be able to tell the difference. So we must remove leading and trailing whitespace before processing.

And now, Unicode makes it possible to screw up royally here. All it takes is one spot in the code where someone forgets about the many kinds of whitespace and we end up with unnormalized data in our database. And things start to fail here and there.

Characters look different

Unlike ASCII, Unicode has multiple code points for the same character and multiple characters that look nearly, or completely, identical without being the same character. As a concrete example, paste "tyрeablе" == "typeable" to your favourite REPL ⁵.

Got a False? That is because the p is not a p but a Russian character for the ‘er’ sound.

As to why this is a problem, let’s take this bit of our database schema as an example:

"uniq_address" UNIQUE CONSTRAINT, btree (country, city, address)
"uniq_name" UNIQUE CONSTRAINT, btree (name)

I would posit that in the Unicode era, these constraints make no sense at all. As user input, the user is free to mimic whatever address or name they want. This allows the user to attempt all kinds of heists by, say, having the same screen name as someone else. Also, things like addresses don’t stay digital. Sooner or later, it’s going to be read or printed and then the difference, which the database was keen to notice, will be gone. Is there anything analog in your process that could be exploited by pretending to be an another user?

This problem certainly preceded Unicode, especially in some character sets like ISO-8859-5, but Unicode makes this much worse and more widely applicable. Getting down to it, you can’t assume almost anything about how the string is going to l̷o̵o̷k̵ ̶l̴i̴k̵e̷.

Text goes from left to right

Quickly, what would happen if I pasted this to my terminal?

‮rm -rf your_home_directory # dlrow olleh ohce

I dare you to try it for yourself. You can use any reasonable dumb thing to paste this in instead of your terminal, if you care about your home directory.

Some languages are not written from left to right, and to accommodate them, Unicode has these ‘flip the direction of writing’ -codes. The actual text is the same even though it is written from right to left, so your terminal would’ve probably tried to wipe your files if you had tried my example.

Urdu script, which is written from right to left

Besides messing with my colleagues on Teams with this, this bidirectional writing has been used for quite a few hoaxes, the longtime favourite being flipping long URLs backwards so they look innocent.

Strings have the same decoding

One of the things we happily assumed with ASCII (and variants) was that the decoding was trivial and unlikely to go wrong. Some of my University colleagues can read ASCII fluently from hex dumps! This meant that the only problem when transmitting data as Strings was to correctly parse the contents of the String.

Unicode, being a multibyte encoding adds another step. You must first parse the String, before you can get started on the content.

Now, parsing is one of the problem areas that is known to cause security issues. One of the key problems is that the same String may get parsed differently in one program than in another. A good contemporary example of this is having an HTML sanitizer (a thing that stops XSS) speak bit different dialect of HTML than the browser the user is using. If these bits disagree on the interpretation of some String, the sanitizer might judge it to be free of scripts and other malicious items, while the browser could interpret things slightly differently and start executing bits of the input as scripts. ⁶

Now this is exacerbated by Unicode, since not all Unicode parsers agree on all sets of bytes. Mostly, it is the illegal Unicode sequences that get handled differently. For example “e3 80 22” is an invalid Unicode sequence and one Unicode parser might judge it to be one illegal character while another could be more lax and interpret it as three: ã, \x80 and ". Now, to put this into web context, the last of the three could be a problem since it would allow XSS through attribute values.

Concluding thoughts

As a software engineer, Unicode puts a lot of complexity on my table and much of that I really don’t need. The individual gotchas listed above are not so hard to handle by themselves, but the effect their presence has on the whole system can be significant. Now you need to decide what kind of strings you allow in your system, you need to figure out how to properly normalize them, how to eliminate homoglyphs and strip and trim whitespace.

The problem with this is that all such things must happen uniformly. If you normalize a String in a certain way in one bit of your program and some other bit does it differently, you have an inconsistency at best or a security issue at worst. You also have to take this into account because, well, mistakes happen, and try to record precisely what has been done to each String so you can take that into account when using them.

And, unfortunately, no, you cannot just ‘fix your Strings’ at every use point. Some String operations are only safe to do once, or you will lose information, or even worse. You need to know and track the semantics of Strings to know what steps you need, and what steps you can’t take in the context you are working on.

Addendum: I’m bit of a sloppy writer, so I feel I need to re-iterate the original point, so I don’t come off as some ASCII-fan.

Some people read this as an argument against Unicode, which it is not. I don’t want to come back to ISO-8859-1, because that sucks. Also, I’m ready to deal with a lot of complexity to allow people write their names properly. What I tried to argue here was that working with Unicode is necessarily more complex than working with ASCII was. And that I see people going about with lots of beliefs about String processing that belong to the ASCII era and which do not work with Unicode.

Some of the examples are low-level, some are dated, but some, like homoglyphs, are ubiquitous. Whether they are relevant for you depends on what kind of work you do and which language you use.

(Also, in the first example, consider UTF-16 and PHP not having null-terminated strings)

Or, we can spend 15 minutes in Wikipedia learning it↩︎
Printing them is another matter. I hope this will be solved by the 22nd century↩︎
Admittedly, I don’t, really↩︎
But, that is simple in imagination only. Don’t.↩︎
repl.it is handy if you have none at hand↩︎
Using the same channel for control and content must be worth more than the billion dollar mistake of including null in programming languages!↩︎

Recommended

The 7 Assumptions You Probably Have About Strings

About Strings

Unicode

Broken assumptions

A character is representable by single byte

String lengths are somewhat stable

Upper and lowercase are somehow linked

Space is 0x20

Characters look different

Text goes from left to right

Strings have the same decoding

Concluding thoughts

You may also like

Privacy policy

Information Collection And Use

Log Data

Cookies

Security

Changes To This Privacy Policy

Contact Us

The 7 Assumptions You Probably Have About Strings

About Strings

Unicode

Broken assumptions

A character is representable by single byte

String lengths are somewhat stable

Upper and lowercase are somehow linked

Space is 0x20

Characters look different

Text goes from left to right

Strings have the same decoding

Concluding thoughts

You may also like

Privacy policy

Information Collection And Use

Log Data

Cookies

Security

Changes To This Privacy Policy

Contact Us

Cookie Declaration