In the previous post, we explored the origin of digital characters with ASCII code. While ASCII was great, its limit of 128 characters was far too inadequate to encompass all the languages of the world.

Today, let's discuss Unicode, the massive promise that goes beyond ASCII and unites all characters of the world, including Hangul, Kanji, Arabic, and even emojis.


In the early days of the Internet, it was common to encounter screens filled with symbols like when accessing websites from other countries. This phenomenon is referred to as 'Mojibake'.

The reason was straightforward. Each country and language used different character encodings. The same binary data would be interpreted by one computer as "this is Hangul" and by another as "this is a Western language".

The savior that emerged to quell this digital Tower of Babel was Unicode.

Surrounding the globe, various characters from different languages

1. What is Unicode?



Unicode is short for "Universal Code", an international standard that assigns a unique number (Code Point) to every character in the world.

Whereas ASCII code numbered from 0 to 127, Unicode effectively extends that range to infinity. Currently, over 140,000 characters are registered in Unicode, which includes not only existing languages but also ancient hieroglyphs, musical notation, and the emojis that we use every day.

Code Point

Unicode manages each character with a hexadecimal number starting with U+.

  • Latin character 'A': $U+0041$ (same as ASCII code)

  • Hangul '가': $U+AC00$

  • Kanji '日': $U+65E5$

  • Emoji '😀': $U+1F600$

Now, all computers worldwide can think of the smiling face (😀) when they receive the signal $U+1F600$, regardless of language settings.

2. Is Unicode the Same as UTF-8?

This is a common point of confusion. “Unicode” and the frequently seen “UTF-8” are not the same. Understanding the relationship between them is the crux of today’s article.

  • Unicode: An abstract 'map' that links characters to numbers.

  • UTF-8 (Encoding): The 'method' of actually storing those numbers in computer memory.

If Unicode is a 'catalog' that assigns ID numbers to all items in the world, then UTF-8 is the 'technology' that packages those items into boxes for transport.

3. Why Did UTF-8 Become the Standard?



There are several ways to store Unicode characters in computers (encoding), such as UTF-16 and UTF-32. However, over 98% of the web currently uses UTF-8.

UTF-8 triumphed due to its brilliant efficiency known as 'variable length'. It allocates different amounts of storage (bytes) depending on the type of character.

Character Type Example Unicode Number UTF-8 Storage Size Feature
Basic Latin A $U+0041$ 1 Byte 100% compatible with ASCII code
Middle Eastern/European Ω, ¶ $U+03A9$ 2 Bytes Latin extended, Greek, etc.
CJK (Chinese, Japanese, Korean) 한, 中, あ $U+D55C$ 3 Bytes Most Asian characters
Emoji/Ancient Languages 🚀 $U+1F680$ 4 Bytes Characters beyond the plane

Advantages of UTF-8

  1. ASCII Compatibility: When saving English documents, the size remains the same as ASCII code. It is perfectly compatible with existing systems.

  2. Efficiency: Frequently used English letters/numbers are stored as 1-byte, while complex characters are stored longer, optimizing overall storage.

4. Cautions for Developers Using Unicode

Unlike the era when characters were composed of 1 byte (ASCII), in a Unicode environment, caution is necessary when calculating the "length of strings".

  • In terms of memory (Bytes): 'A' is 1, '한' is 3.

  • In terms of characters: 'A' is 1 character, '한' is also 1 character.

If you cut or store strings the old way, you risk truncating multi-byte characters (Hangul, Kanji, etc.), leading to data corruption. Thus, modern programming languages have libraries that automatically support this Unicode handling.


Summary

  1. Unicode: An international standard that assigns unique numbers (Code Points) to all characters in the world.

  2. Purpose: To unify encoding methods that differed by language and resolve character corruption (Mojibake).

  3. UTF-8: The most efficient method of storing Unicode. English is stored as 1 byte, while Hangul/Chinese/Japanese characters are variably stored as 3 bytes.

Unicode serves as the most inclusive infrastructure of the digital age, enabling the exchange of information across the globe without language barriers.


🚀 Upcoming Posts

Now that we understand how to save text (encoding), it's time to learn how to handle data more intelligently. "How are images saved as 0s and 1s?" We will demystify pixels, resolution, and the principles of RGB color in a very easy way.

👇 Related Articles You Might Enjoy


If you found this informative, please subscribe so you don’t miss out on other IT topics!