When we look at a document, what do we see? Text – and it comes in a variety of forms, such as letters, numbers, punctuation marks, and other symbols. A computer, on the other hand, does not see text when it looks at a document because it’s not human and can’t actually read. Instead, it sees the text as a series of ones and zeros called binary data. As a result, the characters that comprise text must be represented as numbers so that computers can handle them. Encoding is the process of converting text into a coded format, which consists of numbers, so that a computer is able to read and understand it. More complex languages with a greater number of characters require more numbers to denote them. In the industry, the term for these numbers is “code points.” Determining which languages are more complicated depends on the number of “bytes” that it takes to represent its full alphabet. A byte is simply a unit used to measure quantities of computer information, and it’s equal to eight bits. Now that you know some of the basics, let’s take a look at which languages are easy to encode and which ones are a bit trickier.
The majority of languages are comprised of an alphabet that contains a limited set of punctuation marks, special characters, and text symbols. In languages like this, one byte is plenty to distinguish between every possible character. One little byte has the ability to represent 256 characters – enough for the alphabets of English, French, Italian, German, and Spanish combined! In addition, one byte can represent the alphabets for Russian, Greek, Turkish, Arabic, and Hebrew by themselves. So, the next time you hear someone referring to a “single-byte” language, this is exactly what they’re talking about.
After the last section about single-byte languages, I’m sure you can guess what a multi-byte language is. Languages that require two or more bytes for representing all of the characters and punctuation marks in its alphabet are considered multi-byte languages. The term for combining single-byte characters with two-or-more-byte character is called “multi-byte,” and it’s more common in languages than you might think. The Asian languages – China, Japanese, and Korean (CJK) – are prime examples of this. These languages are intrinsically different because their character sets (all the symbols needed to express the full language) contain a subset that is much simpler, including ASCII characters and punctuation marks, which calls for only one byte. However, Asian languages also have a larger set of ideographic characters of Chinese origin, and there are literally thousands of these characters. As a result, a minimum of two bytes are necessary in order to portray such a high number of these complex characters.
How Encoding Works
The association of languages with encoding (single-byte, double-byte, or multi-byte) has recently been altered due to the advent of modern Unicode. Unicode consists of many useful tools, such as:
If you want to get technical, the reality is that there are a multitude of encoding standards. Although Unicode may be the most popular and well-known, other encoding formats that originate from various standards organizations like ISO, ANSI, and KSC do exist. In fact, some of these use more than one byte, even when dealing with the so-called “single-byte” languages.
For example, a special character in French that is encoded in UTF-8 (Unicode Transformational Format with 8 bits) can be more than one byte. But don’t let this confuse you – French is still categorized as a “single-byte” language, despite the fact that the encoding used in a specific case such as this can be a multi-byte encoding.
For more information about encoding and why it’s so important in today’s technology-driven world of business, check out this FAQ from Lionbridge: What are Double-Byte, Single-Byte, and Multi-Byte Encodings?