Posted on August 17, 2021 by Ville Tirronen

The 7 assumptions about strings you probably have

How Unicode erases most of our assumptions on how Strings actually work

We programmers mostly fly by the seat of our pants when it comes to writing simple stuff. For simple things, we have a strong set of assumptions instead of specific knowledge of how things work. These are assumptions, such as knowing that if b = a + 1, then b is greater than a or that if we malloc some buffer, we now have the requested amount of memory we can write on. We don’t go and look at the specifications for each and everything small thing we do.

We do this because checking everything would slow us down. But, if we did check, we’d find out that we’re usually wrong in our assumptions. There are numeric overflows and then a + 1 might be a lot less than a. Sometimes malloc will give us a null instead of a buffer and were hosed.

We usually have to be bitten by these issues before we update our assumptions even a little bit. And even then, we usually correct them in broad strokes. After having a nasty overflow bug, we might correct our assumptions on integers to “a + 1 is greater than a unless there’s a chance that a is a very big number”. And we work based on that instead of having any precise rules how overflows work in our minds.

Adjusted assumptions are called experience. They make you faster and correct more often. However, we might relocate some stuff, like proper handling of malloc, entirely from our internal category of ‘easy stuff’ to our internal category of ‘complex stuff’. And then we might actually go and look up how it works.

About Strings

For beginners, Strings are the archetypal example of ‘easy stuff’. Most likely, we learned letters and numbers as children and they feel very familiar to us. Secondly, when learning to program most of us have done lot’s of programming exercises using Strings, because they are about the only interesting pre-built data type in most languages. We feel quite confident how Strings work when programming with them. Thirdly, we might have a good number of assumptions related to functioning of some simple character set, like ASCII or ISO-8859-1. Either because we’re that old, or because our teachers were that old. Those were character sets of the simpler times!

Univac 1050-II, 1964, first computer using ASCII.

Now, in the real, Strings are a very complicated thing. Contrast them to, for example, your usual, found in any language, Int. We know and understand the representation (64 bits, two’s complement) 1 and we understand its semantics (behaves like a number, except if too large or too small). For Strings, we used to know the representation (one byte per character, check the ASCII table for what character it is), but we almost never know the semantics. Our String could contain our customers name. It could contain a number, bit of JSON or even an SQL statement.

Strings are the ultimate Any-type and chances are that if there is no ready made representation for some item in a program, it will be stored and operated as a String. Regardless of whether you have dynamic or static types, this throws all type safety to the wind. And, to compound, many of the things we use Strings for are bloody dangerous, like SQL or HTML. And for that reason, SQL injections and cross site scripting lead the vulnerability top lists year after year.

But, at least we understand how Strings work, as you know, Strings? We know how to concatenate, change case and so on, right?

Unicode

Understanding Strings is lot harder now than it was in around year 2000. We have been transitioning to Unicode for few decades now and its already been few years since I’ve heard anyone complain why their characters aren’t displayed right 2.

While being otherwise awesome, Unicode effectively erases most of our ‘useful’ assumptions on how Strings actually work, but we haven’t been very verbal on that happening. And unfortunately, many of us are probably still working with outdated assumptions on how Strings work. And, to make it worse, many of us no longer understand the memory representation of Strings either 3

Broken assumptions

Next, lets go through some of my old assumptions that I needed to throw out with ISO-8859-1 character set. Surely, this is not an exhaustive list, but hopefully it is enough to kick the (Unicode) Strings out of your mental compartment of ‘simple things’.

A character is representable by single byte

In the olden days of ASCII, each character fitted it seven bytes, making it easy to size buffers and scan memory. With Unicode this is an terrible assumption. Let’s walk through one arbitrary example to show why.

At some point, Wordpress devs were fighting to stop SQL-injections from happening. The one example issue they were trying to fix was someone adding unwanted single quotes in the user input and messing their database with it. Something like this imaginary example:

select 1 from accounts
where user = '%s'
    and password = '%s'

↓↓ (User supplies “whocares’ or true –” as password)

select 1 from accounts
where user = 'Avery'
    and password = 'whocares' or true -- '
-- And now everyone can log in as Avery!

Now, the simplest imaginable way to solve this is to properly encode the single quote in the user input 4. That is, each single quote ' must be encoded as \', or backslash-single-quote.

PHP devs then wrote addslashes function and everything was well for a while. The only problem was that they did the escaping byte by byte and not character by character. The devs were also blind to the problem as they only worked with single byte Unicode characters (mostly old ASCII). Then, someone figured out that if you fed the system a String like "뼧 or true -- " you’d get the SQL injection again.

To understand why lets look up how these characters are represented in Unicode:

code character
0xbf27
0xbf5c
0x27 '
0x5c \

What the addslashes actually did was to replace all the value 27 -bytes with bytes 5c 27. So, "뼧 or true -- " turned into "뽜' or true -- " and again, there were injections.

It is not hard to imagine other similar disasters.

String lengths are somewhat stable

In ASCII, the many of the common String processing operations were invariant regards to the length of the Strings. This is not so with Unicode. And though this property is probably relevant only if you’re manually allocating buffers, or trying to size up graphics, let’s look at few cases where String lengths change unexpectedly.

Firstly, to pick a common String operation as an example, does length(x) = length(toUpper(x)) hold for Unicode x? No, since Unicode has, among other things, ligature characters such as , which expand 2 fold to FI.

Second example concerns normalization. Since there are multiple code points for the same character, Unicode forces you to do normalization so that two users don’t, for example, end up with identical screen names. One would guess that normalization, or the process of picking up a canonical representation for some set of characters would not affect the number of normalized characters, but it indeed does: single character expands 18 fold into صلى الله عليه وسلم.

So, it is probably better not to assume anything about lengths of Strings after any operation.

Upper and lowercase are somehow linked

We, who lived with variants of ASCII tend make lot of use of upper and lower casing operations. Besides of them now being able to change the lengths of the Strings, there are some additional sharp edges. Most importantly, the old assumption that upper and lower case letters are in unique correspondence is lost.

With Unicode, converting string to uppercase can lose more information than just what case the characters were in. For example if you lowercase the Kelvin symbol , you get an ordinary lowercase k back, with no way of converting it back. This has surprisingly lot of relevance when doing case insensitive comparisons, since toLower('K') == toLower('k') but toUpper('K') != toUpper('k').

Reason for calling them upper and lower case letters: Uppercase ones go to the ‘upper case’.

Space is 0x20

This assumption is still true. The byte 0x20 represents space in Unicode. But so do U+2000, U+2001, U+2002 and many others, including a zero width space character U+FEFF. Whitespace is special. We can’t allow screen names like “TheAlex” and “TheAlex” at the same time because HTML will not show that whitespace and other users couldn’t tell the difference. So we must remove leading and trailing whitespace before processing.

And now, Unicode makes it possible to screw up royally here. All it takes is one spot in the code where someone forgets about multitude of whitespace and we end up with unnormalized data in our database. And things start to go fail here and there.

Characters look different

Unlike ASCII, Unicode has multiple code points for the same character and multiple characters that look nearly, or completely, identical without being the same character. As a concrete example, paste "tyрeablе" == "typeable" to your favourite REPL 5.

Got False? That is because the p is not a p but a Russian character for ‘er’ sound.

As to why this is a problem, let’s take this bit of our database schema as an example:

"uniq_address" UNIQUE CONSTRAINT, btree (country, city, address)
"uniq_name" UNIQUE CONSTRAINT, btree (name)

I would posit that in Unicode era, these constraints make no sense at all. Being user input the user is free to mimic whatever address or name they want. This allows the user to attempt all kinds of heists by, say having same screen name as someone else. Also, things like addresses don’t stay digital. Sooner or later, it’s going to be read or printed and then the difference, which the database was keen to notice, will be gone. Is there anything analog in your process that could be exploited by pretending to be an another user?

This problem certainly preceded Unicode, especially in some character sets like ISO-8859-5, but Unicode makes this much worse and more widely applicable. Getting down to it, you can’t assume almost anything about how the string is going to l̷o̵o̷k̵ ̶l̴i̴k̵e̷.

Text goes from left to right

Quickly, what happens if I’d paste this to my terminal?

‮rm -rf your_home_directory # dlrow olleh ohce

I dare you to try yourself. You can use any reasonable dumb thing to paste this in instead of your terminal if you care about your home directory.

Some languages are not written from left to right, and to accommodate them, Unicode has these ‘flip the direction of writing’ -codes. The actual text is the same even though it is written from right to left, so your terminal probably would try to wipe your files if you had tried my example.

Urdu script, which is written from right to left

Besides messing with my colleagues on Teams with this, this bidirectional writing has been used for quite a many hoaxes, the longtime favourite being flipping long URLs backwards so they look innocent.

Strings have the same decoding

One of the things we happily assumed with ASCII (and variants) was that the decoding was trivial and unlikely to go wrong. Some of my University colleagues can read ASCII fluently from hex dumps! This meant that the only problem when transmitting data as Strings was to correctly parse the contents of the String.

Unicode, being a multibyte encoding adds another step. You must first parse the String, before you can get started on the content.

Now, parsing is one of the problem areas that is known to cause security issues. One of the key problems is that the same String may get parsed differently in one program than in another. A good contemporary example of this is having and html sanitizer (thing that stops XSS) speak bit different dialect of HTML than the browser that the user is using. If these bits disagree on the interpretation of some String, the sanitizer might judge it to be free of scripts and other malicious items, while the browser could interpret things slightly differently and start executing bits of the input as scripts. 6

Now this is exacerbated by Unicode, since not all Unicode parsers agree on all sets of bytes. Mostly, it is the illegal Unicode sequences that get handled differently. For example “e3 80 22” is an invalid Unicode sequence and one Unicode parser might judge it to be one illegal character while another could be more lax and interpret it as three: ã, \x80 and ". Now, to put this into web context, the last of the three could be a problem since it would allow XSS through attribute values.

Concluding thoughts

As a software engineer, Unicode puts a lot of complexity on my table and much of that I really wouldn’t need. The individual gotchas listed above are not so hard to handle by themselves, but the effect their presence has on the whole system can be significant. Now you need to decide what kind of strings you allow in your system, you need figure out how to properly normalize them, how to eliminate homoglyphs and strip and trim whitespace.

The problem with this is that all such things must happen uniformly. If you normalize a String in a certain way in one bit of your program and some other bit does it differently, you have an inconsistency, or a security issue at the worst. You also have to take this into account, because, well mistakes happen and try to record precisely what has been done to each String so you can take that into account when using them.

And, unfortunately, no, you cannot just ‘fix your strings’ at every use point. Some string operations are only safe to do once or you lose information or worse. You need to know and track the semantics of Strings to know what steps you need, and what steps you can’t take in the context you are working on.


  1. Or we can spend 15 minutes in Wikipedia to learn it↩︎

  2. Printing them is another matter. I hope that it will be solved in 22th century↩︎

  3. Admittedly, I don’t, really↩︎

  4. But, that is simple in imagination only. Don’t.↩︎

  5. repl.it is handy if you have none at the hand↩︎

  6. Using the same channel for control and content must be worth more than the billion dollar mistake of including null in programming languages!↩︎

Want to know more?
Get in touch with us!
Contact Us

Privacy policy

Last updated: 1 September 2021

Typeable OU ("us", "we", or "our") operates https://typeable.io (the "Site"). This page informs you of our policies regarding the collection, use and disclosure of Personal Information we receive from users of the Site.

We use your Personal Information only for providing and improving the Site. By using the Site, you agree to the collection and use of information in accordance with this policy.

Information Collection And Use

While using our Site, we may ask you to provide us with certain personally identifiable information that can be used to contact or identify you. Personally identifiable information may include, but is not limited to your name ("Personal Information").

Log Data

Like many site operators, we collect information that your browser sends whenever you visit our Site ("Log Data").

This Log Data may include information such as your computer's Internet Protocol ("IP") address, browser type, browser version, the pages of our Site that you visit, the time and date of your visit, the time spent on those pages and other statistics.

In addition, we may use third party services such as Google Analytics that collect, monitor and analyze this ...

Cookies

Cookies are files with small amount of data, which may include an anonymous unique identifier. Cookies are sent to your browser from a web site and stored on your computer's hard drive.

Like many sites, we use "cookies" to collect information. You can instruct your browser to refuse all cookies or to indicate when a cookie is being sent. However, if you do not accept cookies, you may not be able to use some portions of our Site.

Security

The security of your Personal Information is important to us, so we don't store any personal information and use third-party GDPR-compliant services to store contact data supplied with a "Contact Us" form and job applications data, suplied via "Careers" page.

Changes To This Privacy Policy

This Privacy Policy is effective as of @@privacePolicyDate​ and will remain in effect except with respect to any changes in its provisions in the future, which will be in effect immediately after being posted on this page.

We reserve the right to update or change our Privacy Policy at any time and you should check this Privacy Policy periodically. Your continued use of the Service after we post any modifications to the Privacy Policy on this page will constitute your acknowledgment of the modifications and your consent to abide and be bound by the modified Privacy Policy.

If we make any material changes to this Privacy Policy, we will notify you either through the email address you have provided us, or by placing a prominent notice on our website.

Contact Us

If you have any questions about this Privacy Policy, please contact us.