~ / labs/unicode

unicode.

paste anything: see grapheme segmentation, codepoints, utf-8 byte lengths, and all four normalization forms. reveals the gap between "what i typed" and "what the computer sees".

graphemes
36
what you read as 'characters'
chars
39
js string .length (utf-16 units)
codepoints
38
utf-8 bytes
67
ascii safe
no
P
U+0050
basic latin1b
o
U+006F
basic latin1b
k
U+006B
basic latin1b
é
U+00E9
latin-1 supplement2b
m
U+006D
basic latin1b
o
U+006F
basic latin1b
n
U+006E
basic latin1b
·
U+0020
basic latin1b
🦁‍⬛
U+1F981U+200DU+2B1B
supplemental symbols/pictographs10b
3
·
U+0020
basic latin1b
U+2014
general punctuation3b
·
U+0020
basic latin1b
c
U+0063
basic latin1b
a
U+0061
basic latin1b
f
U+0066
basic latin1b
é
U+00E9
latin-1 supplement2b
·
U+0020
basic latin1b
·
U+00B7
latin-1 supplement2b
·
U+0020
basic latin1b
U+D55C
hangul syllables3b
U+AE00
hangul syllables3b
·
U+0020
basic latin1b
·
U+00B7
latin-1 supplement2b
·
U+0020
basic latin1b
U+4F60
cjk unified ideographs3b
U+597D
cjk unified ideographs3b
·
U+0020
basic latin1b
·
U+00B7
latin-1 supplement2b
·
U+0020
basic latin1b
ا
U+0627
arabic2b
ل
U+0644
arabic2b
ع
U+0639
arabic2b
ر
U+0631
arabic2b
ب
U+0628
arabic2b
ي
U+064A
arabic2b
ة
U+0629
arabic2b
── normalization forms
NFCcanonical composition
Pokémon 🦁‍⬛ — café · 한글 · 你好 · العربية
67 bytes= input
NFDcanonical decomposition
Pokémon 🦁‍⬛ — café · 한글 · 你好 · العربية
81 bytes≠ input
NFKCcompatibility composition
Pokémon 🦁‍⬛ — café · 한글 · 你好 · العربية
67 bytes= input
NFKDcompatibility decomposition
Pokémon 🦁‍⬛ — café · 한글 · 你好 · العربية
81 bytes≠ input