Understanding Character Encoding

WeChat original

Programmers spend their lives with code—both source code and encodings. Representing text with bits is harder than it looks: character sets, comparators, normalization, locale rules, variable-length encodings, BOMs, surrogate pairs, regex compatibility, even security bugs lurk beneath. Here’s the field guide.

0x01 Basics
#

Computers only understand binary. Encoding bridges abstract data types and bit patterns. Take the number 42: encoding maps it to 00101010; decoding reverses the mapping. Strings are sequences of characters, so “string encoding” maps abstract characters to bits.

Characters vs. glyphs
#

A character is an abstract textual atom; a glyph is its visual rendering. Most of the time they’re 1–1, but not always: à can be a precomposed character or the combination of a + grave accent. Conversely, a single character in Arabic or Devanagari might comprise many glyph components. Don’t confuse the abstract entity with its shapes.

Character sets
#

A character set is the inventory of characters you care about. A coded character set assigns each abstract character a number (a code point). ASCII, GB2312, JIS X 0208, ISO-8859-1 are all different sets tailored to different languages.

0x02 Unicode
#

Unicode tries to be the superset of all sets: one code point per abstract character. It currently defines ~150k characters. Unicode separates layers:

Abstract characters (the idea of “A,” “é,” “汉”).
Code points (U+0041, U+00E9, U+6C49).
Glyphs (fonts decide how to draw them).

It also catalogs metadata: combining marks, directional controls, normalization forms, case folding, collation rules, emoji properties, etc.

0x03 From code points to code units
#

Assigning numbers isn’t enough; we must turn those numbers into byte sequences. Enter code units—the minimal bit chunks used to store/exchange characters. Typical choices are 8, 16, or 32 bits. Unicode defines three mainstream encodings:

UTF-32 – fixed length, one 32-bit code unit per code point. Simple but wasteful.
UTF-16 – variable length, 16-bit units. Most characters fit in one unit; supplementary planes use surrogate pairs.
UTF-8 – variable length, 8-bit units. ASCII stays single-byte; other characters use 2–4 bytes.

Self-synchronizing encodings are critical: if a byte flips, the decoder should recover at the next boundary. UTF-8 and UTF-16 were designed with that in mind (no prefix of a multi-byte sequence is itself a valid start byte).

UTF-32
#

Pros: fixed length → O(1) indexing, trivial implementation. Cons: 4× the storage of ASCII text, often 2×–4× other encodings. Useful inside APIs where convenience outweighs memory cost.

UTF-16
#

Uses 16-bit units. Characters in the Basic Multilingual Plane (U+0000–U+FFFF) fit in one unit; supplementary characters (U+10000–U+10FFFF) require a surrogate pair. Endianness matters, hence BOMs (FE FF vs FF FE). Historically popular on Windows and Java. Downsides: variable length, surrogate pain, awkward for pure ASCII.

UTF-8
#

Dominates the web. ASCII bytes map to themselves; higher code points use multi-byte sequences with leading bits indicating length (0xxxxxxx, 110xxxxx, 1110xxxx, 11110xxx). Advantages: backward-compatible with ASCII, compact for Latin text, byte-order agnostic, easy to resync after corruption. The main cost is variable length—s[i] is not the i‑th character unless you scan.

0x04 Normalization & combining marks
#

Because multiple code-point sequences can render the same glyph (é vs e + combining acute), Unicode defines normalization forms (NFC, NFD, NFKC, NFKD). Comparing strings safely often means normalizing first. Regexes, sorting, case folding, and database uniqueness constraints must choose a consistent form.

0x05 Pitfalls
#

BOMs: UTF-8 doesn’t need one, but some files start with EF BB BF; know how your parser handles it.
Grapheme clusters: “emoji with skin tone” is multiple code points. Counting “characters” really means counting grapheme clusters, not code points.
Security: visually confusable characters (homoglyph attacks) and overlong encodings can bypass filters if you’re sloppy.
APIs: know what your language uses internally (char in C++ vs Java vs Go). Never assume “one byte = one character.”

0x06 Takeaways
#

A character is an abstract symbol; glyphs are its shapes.
Unicode assigns code points; UTFs encode them into bytes.
UTF-8 is the lingua franca, UTF-16 survives in Windows/Java land, UTF-32 is a niche convenience.
Always normalize, validate, and respect boundaries when comparing or slicing strings.
Understand your stack’s encoding assumptions—otherwise “你好” will become mojibake at the worst possible time.

0x01 Basics#

Characters vs. glyphs#

Character sets#

0x02 Unicode#

0x03 From code points to code units#

UTF-32#

UTF-16#

UTF-8#

0x04 Normalization & combining marks#

0x05 Pitfalls#

0x06 Takeaways#

Related