Unicode Block Reference: Find Characters by Range and Script
unicodeblocksrangesscriptscharacter-reference

Unicode Block Reference: Find Characters by Range and Script

UUnicode.live Editorial
2026-06-14
11 min read

A practical Unicode block reference for finding characters by range, script, and common use during debugging and text handling.

Unicode blocks are one of the quickest ways to identify where a character belongs, narrow down a script, debug rendering issues, and build practical validation rules without guessing. This reference explains how Unicode blocks work, how they differ from scripts and categories, and how to use block ranges as a reliable first-pass lookup when you need to find characters by range, script, or common usage.

Overview

This guide is designed as a durable Unicode block reference for developers, QA teams, technical writers, and anyone working with multilingual text. If you have ever inspected a character and wondered whether it came from Latin Extended-A, Arabic, Cyrillic, CJK Symbols and Punctuation, or an emoji-related range, this page gives you a practical way to reason about it.

At a high level, a Unicode block is a named range of code points. Unicode assigns code points such as U+0041 for A or U+03A9 for Ω, and many of those code points are grouped into blocks for organizational purposes. Blocks help you answer questions like:

  • Which range contains this character?
  • What neighboring characters are likely to be related?
  • Is this text mainly Latin, Greek, Cyrillic, Arabic, Devanagari, Han, or something else?
  • Which ranges should I inspect when debugging fonts, search, sorting, slug generation, or input filtering?

For quick orientation, here are some commonly referenced blocks and ranges:

  • Basic Latin: U+0000–U+007F
  • Latin-1 Supplement: U+0080–U+00FF
  • Latin Extended-A: U+0100–U+017F
  • Greek and Coptic: U+0370–U+03FF
  • Cyrillic: U+0400–U+04FF
  • Hebrew: U+0590–U+05FF
  • Arabic: U+0600–U+06FF
  • Devanagari: U+0900–U+097F
  • Thai: U+0E00–U+0E7F
  • Hiragana: U+3040–U+309F
  • Katakana: U+30A0–U+30FF
  • CJK Unified Ideographs: U+4E00–U+9FFF
  • Private Use Area: U+E000–U+F8FF
  • Alphabetic Presentation Forms: U+FB00–U+FB4F
  • Halfwidth and Fullwidth Forms: U+FF00–U+FFEF
  • Emoticons: U+1F600–U+1F64F

Those examples are useful entry points, but they are only part of the picture. A script may span multiple blocks, and a block may contain punctuation, marks, symbols, or compatibility characters that do not behave the way developers expect. The rest of this article focuses on those edge cases, because that is where most real-world debugging happens.

Core concepts

This section gives you the key ideas you need before using any unicode blocks list or unicode range reference in production work.

1. A block is a range, not a language model

A Unicode block is simply a contiguous interval of code points with a label. It is mainly an organizational tool. That means a block name often gives you a strong clue, but not a complete semantic guarantee.

For example, the Arabic block contains characters associated with Arabic writing, but Arabic text in practice may also involve Arabic Supplement, Arabic Extended ranges, combining marks, digits, punctuation, and bidirectional behavior beyond a single block. Likewise, Japanese text may mix Hiragana, Katakana, CJK Unified Ideographs, ASCII punctuation, and fullwidth forms.

Use blocks to narrow down possibilities, not to make simplistic assumptions about language.

2. Block is not the same as script

This is one of the most important distinctions in Unicode work. A block is a code point range. A script is a writing system property such as Latin, Greek, Cyrillic, Arabic, Han, or Common. One script can appear in multiple blocks, and one block may contain characters used across multiple scripts or shared punctuation.

Examples:

  • Latin script appears across Basic Latin, Latin-1 Supplement, Latin Extended-A, Latin Extended-B, and beyond.
  • Han script extends far beyond one block and includes multiple ideograph extensions.
  • Common script characters such as punctuation, spaces, and digits may appear alongside text from many languages.

If your task is input validation or language-aware processing, script data is often more meaningful than block data. If your task is lookup, debugging, or visual inspection, blocks are often the fastest first step.

3. Range notation matters

Unicode ranges are usually written in hexadecimal code point form, such as U+0041 or U+1F642. When you see a block range like U+3040–U+309F, it means every assigned or reserved code point between those endpoints belongs to that block.

Three practical details matter here:

  • Hexadecimal is standard, so convert decimal values before comparing.
  • Ranges include unassigned positions, not just active characters.
  • Supplementary planes include code points above U+FFFF, so make sure your tooling handles them correctly.

This is especially relevant in JavaScript and regex work, where code units and code points are easy to confuse.

4. Characters are not always one visible symbol

Many developers look up a visible symbol and expect one code point. Unicode often does not work that way. What appears as one user-perceived character may be a sequence:

  • a base letter plus combining mark
  • a ZWJ sequence used for emoji presentation
  • a flag made from regional indicator symbols
  • a family emoji assembled from several characters

In those cases, a unicode range reference can tell you where each code point comes from, but the visible result depends on the entire sequence. This is why block lookup is useful for inspection, yet incomplete for grapheme handling.

For deeper normalization and input comparison work, see How to Normalize and Compare User Input Across Languages.

5. Compatibility forms can mislead visual inspection

Some characters look familiar but come from compatibility or presentation ranges rather than the expected script block. Common trouble spots include:

  • Fullwidth forms that resemble ASCII but are different code points
  • Ligatures in Alphabetic Presentation Forms
  • Mathematical alphanumeric symbols that look like stylized Latin letters
  • Confusable characters from different scripts that appear nearly identical

If a string looks normal but fails matching, sorting, or validation checks, inspect the exact code points and their blocks. For phishing, moderation, or identifier reviews, also see Unicode Confusables Checker Guide for Developers.

6. Blocks are useful for CSS, fonts, and rendering triage

When text renders as tofu boxes, mismatched glyphs, or inconsistent widths, block lookup gives you a fast diagnostic path. If missing characters all come from one range, the issue is often tied to font coverage, fallback order, or shaping support rather than your application logic.

This is especially common with:

  • extended Latin letters used in names
  • Arabic shaping and marks
  • CJK punctuation and fullwidth forms
  • emoji presentation differences across platforms

Block awareness will not solve rendering alone, but it helps you isolate the affected ranges before you change fonts, CSS, or fallback behavior.

7. A practical block table to keep in mind

The following simplified table is useful for routine debugging and quick identification:

BlockRangeCommon use
Basic LatinU+0000–U+007FASCII letters, digits, punctuation, control characters
Latin-1 SupplementU+0080–U+00FFWestern European letters and symbols
Latin Extended-AU+0100–U+017FCentral and Eastern European diacritics
Greek and CopticU+0370–U+03FFGreek letters and related signs
CyrillicU+0400–U+04FFSlavic and related writing systems
HebrewU+0590–U+05FFHebrew letters, marks, punctuation
ArabicU+0600–U+06FFArabic letters, digits, marks
DevanagariU+0900–U+097FIndic writing for Hindi and others
HiraganaU+3040–U+309FJapanese syllabary
KatakanaU+30A0–U+30FFJapanese syllabary and phonetic forms
CJK Unified IdeographsU+4E00–U+9FFFHan ideographs used across East Asian text
Hangul SyllablesU+AC00–U+D7AFKorean syllables
Private Use AreaU+E000–U+F8FFApplication-specific assignments
Halfwidth and Fullwidth FormsU+FF00–U+FFEFCompatibility width variants
EmoticonsU+1F600–U+1F64FCommon emoji faces and gestures

Treat this table as a shortcut, not a full catalog. It is enough to help you find a Unicode block quickly and decide what to inspect next.

If you are using a character block table regularly, these related Unicode terms are worth keeping separate.

Code point

A code point is the abstract numeric value assigned by Unicode, written like U+0065. Blocks are built from code point ranges.

Character

In plain discussion, character often means a visible symbol. In implementation work, that can be misleading because one visible symbol may involve multiple code points.

Grapheme cluster

A grapheme cluster is a user-perceived character. This matters for cursor movement, deletion, string length, and emoji handling.

Script

A script is the writing system property associated with a character. Script analysis is often better than block analysis for language-aware validation.

General category

This property groups characters by type, such as uppercase letter, lowercase letter, decimal number, punctuation, symbol, or combining mark. It is often more useful than block membership for parser logic.

Normalization

Normalization converts canonically equivalent sequences into a consistent form such as NFC or NFD. Two strings can look identical while using different code point sequences. For that reason, block lookup should be paired with normalization in search, matching, and deduplication tasks.

Confusable characters

These are characters from different scripts or ranges that look the same or nearly the same. This matters in usernames, domains, code review, and security-sensitive displays.

Private Use Area

These ranges are intentionally left for private agreements between systems. If a character lands there, you cannot assume a standard meaning without application context.

Combining mark

A combining mark modifies a base character. Looking only at the base letter's block may cause you to miss why sorting, rendering, or equality checks behave unexpectedly.

If you need to inspect escaped representations while debugging these distinctions, see How to Convert Text to Unicode Escape Sequences.

Practical use cases

This is where a unicode script ranges reference becomes most useful: real workflows where you need a quick answer and a repeatable method.

Debugging unexpected input

Suppose a registration form accepts a username that later breaks a downstream process. A block lookup can reveal that the value contains fullwidth Latin letters, combining marks, or mixed-script confusables rather than plain ASCII. That insight helps you decide whether to normalize, reject, or flag the input for review.

Related reading: How to Validate Unicode in JSON APIs and Web Forms.

Building character allowlists with caution

Some systems use allowlists for identifiers, SKU codes, slugs, or search filters. Block-based rules can be a useful starting point, such as permitting Basic Latin plus a small set of Latin extensions. But block-only rules often become too broad or too narrow. A better pattern is:

  1. Start with the target user need.
  2. List scripts and categories you want to support.
  3. Use blocks as a review aid, not the only rule source.
  4. Test with real names, addresses, and multilingual content.

For URL-related decisions, see Slug Generation for Multilingual URLs: Unicode vs ASCII.

Investigating mojibake and broken encoding

When text displays as nonsense, the visible garbage often lands in recognizable blocks that hint at what happened during decoding. A string that should contain UTF-8 text may display as Latin-1-derived artifacts instead. Seeing which ranges the broken characters fall into gives you clues about the incorrect interpretation step.

For a fuller workflow, see How to Detect Mojibake and Fix Broken Text Encoding.

Auditing whitespace and invisible characters

Not every debugging problem involves letters. Unicode includes multiple spaces, joiners, non-breaking characters, and formatting controls. These may belong to ranges that are easy to miss in editors and logs. A block and code point inspection routine is often the fastest way to confirm whether a bug comes from ordinary spaces or hidden Unicode characters.

See Unicode Whitespace Characters List and Testing Guide.

Reviewing text for search and deduplication

If two records fail to match, the difference may come from composed versus decomposed characters, script mixing, presentation forms, or punctuation from different ranges. A Unicode range reference helps you inspect the suspect characters before you adjust your normalization and comparison pipeline.

See How to Normalize and Compare User Input Across Languages.

Checking font coverage and layout regressions

When a UI starts clipping, substituting, or misaligning text after a font change, block-aware inspection can show whether the missing glyphs come from one specific range. This is especially useful in multilingual interfaces, dashboards, CMS previews, and browser-based developer tools where the same component may render data from many locales.

Working with transliteration and slug pipelines

When you convert names or titles into ASCII-friendly slugs, knowing the source blocks helps you choose the right transliteration rules and identify where custom mappings are required. Latin extensions, Greek, Cyrillic, and Indic scripts all present different tradeoffs.

See Best Libraries for Unicode Transliteration and Slugification.

Doing fast lookup before deeper analysis

In many workflows, blocks are the first layer of diagnosis:

  • copy a suspicious character
  • inspect its code point
  • find the containing block
  • check whether the surrounding string mixes blocks unexpectedly
  • then move on to script, category, normalization, or rendering analysis

That sequence is fast, teachable, and useful across QA, frontend development, API validation, and SEO-focused content tooling.

If you want a broader survey of useful lookup utilities, see Best Unicode Characters and Emoji Lookup Tools.

When to revisit

Use this section as a maintenance checklist. Unicode block references are evergreen, but your understanding and implementation rules should be revisited whenever your text surface area changes.

Revisit when your product adds new languages or markets

If you expand beyond a narrow Latin-only workflow, your assumptions about blocks, scripts, widths, line breaking, and normalization can stop being sufficient very quickly. Review your accepted ranges, fallback fonts, search rules, and QA examples.

Revisit when validation rules become user-visible

If people begin reporting rejected names, broken pasted content, or inconsistent search results, block lookup is often the first diagnostic step. Revisit your reference before hardening any new restrictions.

Revisit when font stacks or rendering behavior change

A new design system, webfont, or browser baseline can expose previously hidden gaps in character support. When text rendering changes, review the affected characters by block and verify whether the issue is glyph coverage, shaping, width, or normalization.

Revisit when you encounter mixed-script or security-sensitive text

Usernames, domains, promo codes, and internal identifiers deserve extra inspection. Block lookup alone is not a security control, but it is a practical way to spot unusual ranges and mixed-script input before escalating to confusable or policy checks.

Revisit when examples in your docs become stale

Reference content remains useful only if it reflects the kinds of strings your team actually handles. Refresh your examples with current application data, not just textbook samples.

A practical workflow to keep

  1. Capture the exact character or string.
  2. Convert it to code points.
  3. Identify the containing block for each code point.
  4. Check for mixed blocks, combining marks, width variants, and invisible characters.
  5. Then evaluate script, category, normalization, and rendering context.
  6. Document the finding so the next debugging pass is faster.

If you maintain release processes, pairing this reference with a QA checklist is worthwhile. A good starting point is How to Build a Unicode Text QA Checklist for Web Releases.

The main takeaway is simple: use Unicode blocks as a fast map, not the whole territory. They are excellent for lookup, triage, and orientation. For production decisions, combine them with script analysis, normalization, category data, and real-world test strings. That approach gives you a reference you can return to whenever text behaves in ways your tooling did not initially make obvious.

Related Topics

#unicode#blocks#ranges#scripts#character-reference
U

Unicode.live Editorial

Senior SEO Editor

Senior editor and content strategist. Writing about technology, design, and the future of digital media. Follow along for deep dives into the industry's moving parts.

2026-06-14T07:52:56.153Z