Encode and Decode URLs With Unicode Characters

A practical guide to encoding and decoding URLs with Unicode text, percent-encoding, path handling, and common multilingual URL mistakes.

URLs seem simple until a page title, filename, query value, or slug includes Japanese, Arabic, accented Latin letters, emoji, or other non-ASCII characters. At that point, many developers run into broken links, double-encoding bugs, unreadable logs, and mismatched routes across browsers, frameworks, and APIs. This guide explains how URL encoding works when Unicode is involved, where percent-encoding belongs, how paths differ from query strings, and which mistakes are most likely to create production issues. The goal is practical: help you encode and decode multilingual URLs confidently without damaging meaning or interoperability.

Overview

This section gives you a working mental model. If you understand one thing, make it this: a URL is not just “text.” Different parts of a URL follow different rules, and non-ASCII characters are typically represented through UTF-8 bytes and percent-encoding when serialized.

When developers say “URL encoding,” they usually mean percent-encoding. In percent-encoding, a byte is written as a percent sign followed by two hexadecimal digits. A space may appear as %20 in general URL encoding, though in form-style query strings it may also be represented as +. That distinction matters because not every encoder handles query strings, paths, and form bodies the same way.

Non-ASCII URL handling becomes clearer if you separate three ideas:

Unicode text: the human-readable string, such as café, 東京, or مرحبا.
UTF-8 bytes: the byte representation of that text.
Percent-encoded URL form: those bytes written into a URL-safe representation, such as caf%C3%A9.

For example, the character é is not ASCII. In UTF-8 it becomes a sequence of bytes, and those bytes become percent-encoded in the URL. That is why café often appears as caf%C3%A9 in logs, address bars, or server-side routing.

It also helps to remember that a URL has components:

Scheme: https
Host: example.com
Path: /products/café
Query: ?q=東京
Fragment: #section

Each component has its own encoding expectations. A safe way to work is to treat components separately instead of encoding the entire URL as one opaque string.

If your workflow regularly touches slugs, multilingual input, or browser-based developer tools, it is also worth comparing utilities that expose component-level behavior rather than only a single “encode/decode” button. For broader evaluation criteria, see How to Compare Browser-Based Unicode Tools for Daily Dev Work.

Core framework

This section gives you a repeatable method for handling international URL characters correctly.

1. Start with the URL component, not the full string

The most common source of bugs is using one encoder on an entire URL. For example, if you encode https://example.com/search?q=café as one blob, you may accidentally encode characters like :, /, ?, and = that are meaningful separators.

A better pattern is:

Build the base URL structure.
Encode each dynamic component in the right context.
Join them using a URL-aware API when possible.

That means path segments should be handled as path segments, and query values should be handled as query values.

2. Understand what should be encoded

Not all characters in a URL are treated the same way. Some characters are reserved because they define structure. Others are unreserved and can appear directly in many contexts. Non-ASCII characters usually need conversion into a serialized form suitable for transport and parsing.

As practical guidance:

Encode user-provided path segments before inserting them into the path.
Encode query parameter names and values separately.
Do not blindly decode everything unless you know which component you are decoding and only decode once.

This matters because /files/a%2Fb and /files/a/b are not equivalent. In one case the slash is data; in the other it is a path separator.

3. UTF-8 is the practical default

When dealing with Unicode URLs on the modern web, UTF-8 is the practical baseline. If you see strange output after decoding percent-encoded sequences, the underlying issue is often one of these:

The bytes were not UTF-8 in the first place.
The string was decoded using the wrong character encoding.
The value was encoded twice and decoded once, or the reverse.
The original text already contained mojibake before URL handling began.

If the decoded result looks corrupted, the bug may be upstream in your text pipeline rather than in the URL layer. For that troubleshooting path, see How to Detect Mojibake and Fix Broken Text Encoding.

4. Treat paths and query strings differently

This distinction is where many production issues begin.

Path segments identify hierarchical location. A path like /docs/naïve approach should not be constructed with raw spaces or unescaped special characters. Each segment should be encoded so that separators remain separators and segment content remains segment content.

Query strings represent parameterized data. A query like ?q=smørrebrød&lang=da contains keys and values separated by = and joined by &. Keys and values should be encoded independently. If a value itself contains an ampersand and you fail to encode it, the parser may split one value into multiple parameters.

In browser JavaScript, this often means preferring URL-aware APIs such as URL and URLSearchParams rather than assembling strings manually.

5. Normalize only when your application requires it

Unicode introduces another subtle issue: visually identical text can have different underlying code point sequences. For example, some accented characters can be represented in composed or decomposed forms. If your app performs route matching, caching, deduplication, or security checks, normalization may matter before URL serialization.

However, normalization is an application decision, not a universal URL rule. Do not normalize indiscriminately if exact user input must be preserved. If comparison behavior matters in your system, review How to Normalize and Compare User Input Across Languages.

6. Internationalized hosts are a separate concern

Non-ASCII hostnames are related but not identical to percent-encoding in paths and queries. Internationalized domain names use their own representation rules. In practice, keep hostname handling separate from path and query handling. If your issue is the domain itself rather than the path after it, do not assume your normal path encoder is the right solution.

Practical examples

This section turns the framework into concrete patterns you can reuse.

Encoding a multilingual path segment

Suppose a page slug comes from a title: 東京ガイド. If you place that directly into a URL path, different systems may display it differently, but the serialized transport form will generally percent-encode the UTF-8 bytes.

Good pattern:

Base path: /articles/
Dynamic segment: 東京ガイド
Encode the segment before appending it

This preserves the path structure while safely representing the Unicode content.

If you are deciding whether to keep native-script slugs or convert them to ASCII approximations, that is partly a product and SEO question, not just an encoding question. A useful companion read is Slug Generation for Multilingual URLs: Unicode vs ASCII.

Encoding a query parameter with non-ASCII text

Imagine a search endpoint with the query value crème brûlée. If you concatenate the raw string into the URL manually, spaces and accented characters may produce inconsistent results across systems. Instead, encode the query value in query context.

Good pattern:

Parameter name: q
Parameter value: crème brûlée
Serialize via a query-aware encoder or API

This avoids accidental splitting, broken spaces, or invalid output.

Preserving slashes inside data

Suppose a filename or identifier contains a slash-like meaning as data, such as Q1/2026. If you inject it into a path without proper encoding, the slash becomes a separator and changes routing. The safe version encodes the segment so the slash is treated as content rather than structure.

This is one reason path parameters should always be encoded before joining them into the route.

Decoding percent-encoded Unicode safely

If you receive a URL fragment like caf%C3%A9, decode it once in the correct component context. If the decoded result becomes café, that is expected. If decoding again throws an error or changes the value unexpectedly, you likely had a double-decoding bug.

A simple rule: if you are not certain whether data is already decoded, inspect the raw input first. Repeated decoding is a common source of corruption.

Using JavaScript without overcomplicating it

In browser JavaScript, a practical split is:

Use encodeURIComponent() for dynamic components such as path segments or query values.
Use decodeURIComponent() when reversing those encoded components.
Use URL and URLSearchParams to assemble full URLs instead of string concatenation.

For example, building a search URL is usually safer with a URL object and searchParams.set() than with manual string building. That approach also makes debugging easier because you can inspect each component separately.

Handling user input before encoding

If your URL components come from forms, APIs, or pasted content, test for more than obvious accented characters. Hidden whitespace, lookalike characters, and mixed normalization forms can all affect matching and routing.

Related references on unicode.live can help with those cases:

If your workflow needs transliteration or ASCII-only slugs before URL encoding, review Best Libraries for Unicode Transliteration and Slugification.

Common mistakes

This section highlights the failures that waste the most debugging time.

Encoding the whole URL at once

This is the classic mistake. It often turns separators into data and breaks parsing. Encode components, not the entire URL string.

Using the wrong function for the context

Some APIs are designed for a component, others for full URIs, and others for form submissions. Mixing them can leave reserved characters untouched when they should be encoded, or encode too much when they should not. If you are working with query parameters, use query-aware APIs. If you are encoding one dynamic path segment, use a component-level encoder.

Confusing `%20` with `+`

Space handling differs by context. In many generic URL contexts, a space is percent-encoded as %20. In form-style query serialization, a space may appear as +. Problems appear when one side expects one form and the other side decodes using different rules. Be explicit about whether you are handling a URL component or form-urlencoded data.

Double-encoding

If é becomes %C3%A9, and then the percent signs themselves are encoded again, you get something like %25C3%25A9. This usually happens when already encoded values are passed through an encoder a second time. The result looks superficially valid but decodes incorrectly unless reversed in the same broken way.

One practical defense is to keep raw values and serialized values in separate variables and avoid reusing one in both roles.

Double-decoding

The inverse problem also appears in routers, middleware, reverse proxies, and custom parsers. A component that should be decoded once gets decoded twice, potentially turning protected characters into active separators. This can create logic bugs and, in some systems, security concerns.

Assuming visible characters tell the whole story

Two strings may look the same but differ in underlying code points. Trailing spaces may be non-breaking spaces rather than regular spaces. Some punctuation and slashes may be visually similar but not identical. If a route “looks right” and still fails, inspect code points instead of trusting appearance alone. The Unicode Block Reference: Find Characters by Range and Script can help when you need to identify unfamiliar characters.

Skipping tests for multilingual and RTL input

A URL system that works for English may still fail for Arabic, Hebrew, Hindi, Thai, Chinese, or mixed-script input. Right-to-left text can also make copied URLs or logs harder to inspect visually. Include multilingual examples in your test cases, especially if titles, usernames, file paths, or search terms come from users.

If you maintain release processes around text correctness, build URL checks into your QA routine. A solid starting point is How to Build a Unicode Text QA Checklist for Web Releases.

When to revisit

This topic is worth revisiting whenever your URL inputs, frameworks, or tooling change. The underlying principles stay fairly stable, but the edge cases that matter in practice often shift with your application.

Review your approach when:

You add multilingual slugs, search, or user-generated paths.
You migrate to a new router, framework, CDN, or reverse proxy.
You switch from manual string building to URL-aware APIs, or the reverse.
You see unexpected percent sequences, broken links, or unreadable logs.
You begin preserving native-script identifiers rather than ASCII-only slugs.
You add validation, normalization, or security checks around user input.

A useful maintenance habit is to keep a small regression set of URL cases that cover:

Accented Latin text like café
CJK text like 東京
RTL text like مرحبا
Spaces and punctuation
Slash characters used as data
Emoji or supplementary-plane characters if your app allows them
Composed and decomposed Unicode forms where matching matters

For day-to-day work, the practical checklist is simple:

Decide which URL component you are handling.
Encode only that component.
Use UTF-8 based, URL-aware tooling where possible.
Decode once, in the right place.
Test with real multilingual examples, not only ASCII.
Inspect suspicious values at the code point or byte level when behavior is unclear.

If you return to this guide later, that checklist is the main thing to reuse. Most URL bugs with non-ASCII characters come from losing track of context: path vs query, raw text vs serialized form, display vs transport. Keep those boundaries clear, and percent-encoding becomes a predictable tool rather than a source of surprises.

How to Encode and Decode URLs with Non-ASCII Characters

Overview

Core framework