Unicode Lookup
Enter any character to see its Unicode codepoint, category, and UTF-8/UTF-16 encoding. Supports emoji, CJK, and the full Unicode range. Everything runs in your browser — nothing is sent to any server.
What Is Unicode?
Unicode is a universal character encoding standard maintained by the Unicode Consortium. Its goal is to assign a unique number, called a codepoint, to every character used in every writing system in the world. As of Unicode 15.1, the standard covers over 149,000 characters across 161 scripts, including Latin, Greek, Cyrillic, Arabic, Devanagari, CJK (Chinese, Japanese, Korean), and thousands of emoji and symbols.
Before Unicode, there were hundreds of incompatible character encoding systems. ASCII handled English with 128 characters. ISO 8859 added support for European languages. Shift-JIS and EUC-JP handled Japanese. GB2312 and Big5 handled Simplified and Traditional Chinese. Each system used overlapping code ranges for different characters, so text written in one encoding would display as garbled characters (mojibake) in another. Unicode solved this by creating a single, unified numbering system for all characters.
This lookup tool lets you paste any character and instantly see its Unicode codepoint, category, UTF-8 byte encoding, UTF-16 code units, and HTML entity. It supports the entire Unicode range including emoji, supplementary plane characters, and combining marks. Everything runs locally in your browser.
How Unicode Encoding Works
A Unicode codepoint is a number, typically written with a U+ prefix in hexadecimal (e.g., U+0041 for the letter "A", U+4E16 for the Chinese character "world"). But a codepoint is not an encoding. To actually store or transmit a character in a computer, the codepoint must be encoded into bytes using a Unicode Transformation Format (UTF).
UTF-8 is the most widely used encoding on the web. It uses 1 byte for ASCII characters (U+0000 to U+007F), 2 bytes for Latin-extended and many other scripts (U+0080 to U+07FF), 3 bytes for the Basic Multilingual Plane including CJK (U+0800 to U+FFFF), and 4 bytes for supplementary characters like emoji (U+10000 and above). This variable-width design makes UTF-8 backward-compatible with ASCII and space-efficient for English-heavy text.
UTF-16 uses 2 bytes for characters in the Basic Multilingual Plane (BMP, U+0000 to U+FFFF) and 4 bytes (a surrogate pair) for supplementary characters. JavaScript strings are internally stored as UTF-16, which is why this tool shows UTF-16 code units alongside UTF-8 bytes. Understanding both encodings is essential when working with text processing in different programming environments.
Common Use Cases
- Debugging encoding issues — When text appears garbled (e.g., "é" instead of "e"), looking up the codepoints and their UTF-8 byte sequences can identify whether data was double-encoded or decoded with the wrong charset.
- HTML entity reference — When you need to insert a special character in HTML and want to use a numeric entity (e.g., € for the euro sign), this tool gives you the decimal codepoint to construct the entity.
- Cross-platform text verification — A character that looks identical on screen might have different codepoints (e.g., Latin "A" U+0041 vs Cyrillic "A" U+0410). This tool reveals the actual codepoint, helping catch homoglyph issues in security contexts.
- Internationalization (i18n) development — Developers building multilingual applications need to understand how characters from different scripts are encoded, how many bytes they consume, and what category they belong to.
- Emoji analysis — Modern emoji can be composed of multiple codepoints (e.g., family emoji are sequences of person, zero-width joiner, and gender modifier codepoints). This tool breaks them down character by character.
- Database column sizing — When designing database schemas for multilingual data, knowing that CJK characters require 3 bytes in UTF-8 while Latin characters require 1 byte helps you estimate storage needs accurately.
- Regular expression writing — Building regex patterns that match specific Unicode categories or ranges requires knowledge of codepoint values and category names that this tool provides.
- Typography and font development — Font designers need to know which Unicode blocks their fonts should cover and what codepoints each glyph maps to.
Tips and Best Practices
- Paste, do not type, for exotic characters. If you need to look up an emoji or CJK character, copy it from the source where you found it and paste it into the input. This avoids keyboard input method issues.
- Check for invisible characters. If your text behaves unexpectedly (e.g., strings that look identical but do not compare as equal), paste both versions into this tool. You may find zero-width joiners (U+200D), byte order marks (U+FEFF), or other invisible characters.
- Use the UTF-8 column for storage calculations. Count the number of hex bytes shown in the UTF-8 column to know exactly how many bytes a character consumes on disk when stored as UTF-8.
- Understand surrogate pairs for JavaScript. Characters above U+FFFF (like most emoji) are stored as two UTF-16 code units in JavaScript. This means
"emoji".lengthreturns 2 for a single emoji character. The UTF-16 column in this tool shows both code units. - Use HTML entities for maximum compatibility. In HTML documents where you are unsure about the character encoding declaration, using numeric HTML entities (e.g., é for e with acute accent) guarantees correct rendering regardless of the document's charset.
Unicode Lookup vs Alternatives
unicode.org charts: The official Unicode character charts are comprehensive and authoritative but require navigating PDF files or a complex search interface. This tool gives you instant results by simply pasting a character.
fileformat.info and compart.com: These websites provide detailed Unicode character pages with full property listings. They are excellent reference sites but require searching by codepoint or name. This tool is faster when you already have the character and just need its encoding details.
Programming language utilities: Python's unicodedata module, JavaScript's codePointAt(), and similar language functions provide Unicode information programmatically. This tool is more convenient for quick lookups when you do not have a REPL open, and it shows multiple encoding formats side by side in a single view.