How to Detect Mojibake and Fix Broken Text Encoding

A practical checklist for spotting mojibake, tracing encoding mismatches, and repairing broken text safely across web and data workflows.

Mojibake is what happens when text bytes are decoded with the wrong character encoding, and the result is usually obvious only after the damage has already spread through a page, database, feed, export, or API response. This guide gives you a reusable checklist for detecting broken text encoding, recognizing common corruption patterns, and applying safe recovery steps across common web stacks. The goal is practical: help you decide whether the text can be repaired, where the encoding mismatch happened, and how to stop the same issue from returning.

Overview

If you need to fix mojibake, start with one rule: bytes are not the same as text. Text becomes broken when a system reads a sequence of bytes using the wrong encoding assumptions. The bytes may still be intact, but the decode step is wrong. In other cases, the text has already been corrupted and re-saved, which makes recovery harder.

Classic examples include:

cafÃ© instead of café
â€” instead of an em dash —
ðŸ˜‚ instead of an emoji
FranÃ§ois instead of François

These patterns often point to UTF-8 bytes being decoded as Windows-1252 or ISO-8859-1. Not every case is that simple, but many day-to-day encoding issue repair jobs begin there.

A reliable workflow usually follows this order:

Identify whether the problem is display-only or stored corruption.
Capture the original bytes if possible.
Find the first system that interpreted those bytes incorrectly.
Test a reversible repair on a sample.
Apply the fix in one controlled place, not across the whole stack at once.
Add checks so the issue does not reappear.

Before changing any data, keep a copy of the raw source. If the text has been passed through multiple tools, editors, exports, or database layers, the repair path depends on whether you still have the original byte stream.

It also helps to separate mojibake from nearby Unicode problems. Broken text encoding is not the same as:

Missing font glyphs
Bidirectional display issues
Zero-width characters affecting layout
Normalization differences between visually similar strings

If the characters exist but render strangely because of script direction or invisible control characters, you may need adjacent debugging steps. For related cases, unicode.live also covers RTL and LTR debugging, zero-width character removal, and UTF-8 vs UTF-16 vs UTF-32.

Checklist by scenario

Use this section as a quick diagnosis map. Start with the scenario that matches where the garbled text Unicode problem appears first.

1. The browser page shows broken characters, but the source data may still be correct

What you get: a fast checklist for front-end decoding mismatches.

Check the HTTP Content-Type header and its charset declaration.
Check the document-level encoding declaration, such as <meta charset="utf-8">.
Make sure the declaration appears early in the document.
Confirm the actual file bytes are saved in the declared encoding.
Open the network response and inspect raw response bytes if your tooling allows it.
Compare what the API returned with what the browser rendered.

If the page says UTF-8 but the file was actually saved in another encoding, the declaration alone will not help. Likewise, if the bytes are UTF-8 but an old proxy or template layer declares Latin-1, you will see classic utf-8 mojibake.

Common clue: punctuation becomes sequences like â€œ, â€, or â€“. That often means curly quotes or dashes were encoded as UTF-8 and decoded as a legacy Western encoding.

2. The database already contains garbled text

What you get: a safe approach before attempting bulk repair.

Export a small sample in a way that preserves original bytes as much as possible.
Check the database character set, collation, table settings, connection settings, and client settings.
Verify the application driver is sending and receiving text in the expected encoding.
Determine whether the bad text was stored already corrupted or only displayed incorrectly on read.
Look for patterns repeated across rows from the same import window or migration.
Test repair on a copy, never in place.

This distinction matters: if café became cafÃ© before insertion, the database may be faithfully storing broken text. If the database stores correct UTF-8 but a reporting script decodes it incorrectly on output, the fix belongs in the export layer instead.

Many teams lose time by changing table settings after the fact and expecting already-broken rows to heal automatically. Encoding configuration can prevent future corruption, but it does not usually repair damaged text already saved.

3. A CSV, spreadsheet export, or text file looks fine in one app and broken in another

What you get: a file-oriented checklist for common handoff problems.

Check whether the file includes a BOM and whether the target app expects one.
Open the same file in a code editor that lets you inspect and switch encodings.
Confirm whether the exporter wrote UTF-8, UTF-16, or a legacy encoding.
Check line-ending conversions and any intermediate scripts that may re-save the file.
If a spreadsheet is involved, test import options explicitly instead of relying on autodetection.

Spreadsheet software is a frequent source of confusion because import and save behavior may differ by workflow. A file that is valid UTF-8 can still look broken if the receiving app guesses a different code page.

4. API payloads or webhooks contain broken text

What you get: a checklist for services and integrations.

Inspect the raw HTTP payload, not just a prettified client view.
Check response headers and request headers for charset details.
Verify your application framework does not decode and re-encode payloads unexpectedly.
Confirm logs are not masking the issue by escaping or truncating characters.
Test the same payload through a minimal script to isolate framework behavior.

If the sender emits valid UTF-8 but your middleware treats bytes as Latin-1 before JSON parsing or logging, the corruption can appear downstream in places that were never at fault. Keep one known-good raw payload for comparison.

5. Emoji or non-Latin scripts are especially broken

What you get: a way to narrow cases where ASCII looked fine but everything else failed.

Check whether the pipeline was only tested with plain English text.
Inspect code points and byte sequences for the affected characters.
Confirm every layer supports the full character range required by your data.
Check for lossy conversions to legacy encodings that cannot represent the characters.
Look for replacement characters such as �, which may indicate data loss rather than a reversible decode mistake.

ASCII often survives bad pipelines, which can create a false sense of safety. Emoji, accented Latin text, Arabic, Cyrillic, CJK, and combining marks expose weak assumptions quickly. Tools that let you inspect code points can help you tell whether you have the intended characters or only a visual approximation. See How to Inspect and Convert Unicode Code Points Online and Best Unicode Characters and Emoji Lookup Tools.

6. You need to repair text, not just diagnose it

What you get: a conservative repair workflow.

Collect several known-bad examples and, if possible, the intended correct text.
Group examples by pattern. Do not assume one repair rule fits all rows.
Test whether the text can be restored by re-encoding the garbled string as one encoding and decoding as another.
Validate repaired output on accented text, punctuation, emoji, and non-Latin scripts.
Confirm the repair is reversible on sample data before running it widely.
Apply the fix only after isolating the root cause, or the problem may reoccur immediately.

A common reversible case is text that was originally UTF-8 but later interpreted as Windows-1252 or ISO-8859-1. If that happened only once and no further damage followed, repair may be straightforward. If the text was decoded wrongly, saved, exported, re-imported, and transformed again, the corruption may be layered. In layered cases, repair often requires several careful passes, and some information may be unrecoverable.

When evaluating a candidate fix, use a sample set with:

Accented Latin text like café and naïve
Smart punctuation and dashes
Emoji
At least one right-to-left script if your product supports it
Combining marks and precomposed forms where relevant

What to double-check

This section is the sanity check list that catches many avoidable mistakes before you edit settings or run a repair script.

Trace the first bad hop

Do not start where the bug is most visible. Start where the text was last known to be correct. Then follow the path forward: source file, form submission, API, queue, database, ORM, template, response, browser, export. The first bad hop matters more than the final symptom.

Compare bytes, code points, and rendered text

Developers often inspect only the rendered text. Instead, check three levels when possible:

Bytes: what was actually stored or transmitted
Code points: what characters the system believes it has
Rendered output: what the user sees on screen

If the bytes are right and the render is wrong, you may have a font or layout issue rather than mojibake. If the code points are already wrong, the decode step has likely failed upstream.

Normalization is separate from encoding

Two strings can look identical but have different internal forms, such as precomposed and combining sequences. That affects comparison, searching, and deduplication, but it is not the same as broken text encoding. Treat normalization and mojibake as different problems, even if both appear in multilingual pipelines. If you need escape-level inspection, How to Convert Text to Unicode Escape Sequences and the HTML Unicode Escapes Reference for Developers can help.

Legacy defaults still appear in modern stacks

Even when your app is designed around UTF-8, older defaults can leak in through database clients, shell environments, import tools, editors, email systems, and export utilities. A single non-UTF-8 step can produce garbled text unicode issues that look like a platform-wide failure.

Replacement characters usually mean some data is gone

If you see �, be cautious. The Unicode replacement character often means invalid byte sequences were encountered and substituted during decoding. That can indicate a repairable decode mismatch, but it can also mean original bytes were discarded. Recovery may be partial at best.

Script direction and invisible controls can hide the real issue

If the text appears scrambled only in mixed-direction content, check bidirectional controls and display order before concluding you have an encoding problem. Script detection can also clarify whether characters are from the expected writing system. Related reading: Unicode Script Detection Methods Compared.

Common mistakes

These are the traps that make broken text encoding harder to repair.

Changing every layer at once. If you update database settings, app config, and export scripts together, you may not know which change fixed or worsened the issue.
Assuming UTF-8 everywhere because the codebase says so. One import script, shell locale, or editor save action can still break the chain.
Repairing already-correct text. Some strings only look odd because of display or font issues. Running a mojibake fix on them can create real corruption.
Ignoring intermediate storage. Queues, caches, logs, message brokers, and ETL jobs can be the actual source of corruption.
Confusing HTML escaping with encoding. — and & issues are related to escaping rules, not necessarily to character decoding.
Running bulk updates without byte-level backups. A bad repair pass can turn recoverable text into permanent loss.
Stopping at display fixes. If the root cause remains in the ingest path, the same mojibake will return with the next batch, import, or deploy.

A useful rule is simple: if you cannot explain exactly which bytes were misread and where, do not bulk-fix production data yet.

When to revisit

Encoding bugs often return when workflows change. Use this section as an action list for future reviews.

Revisit your mojibake detection and prevention checklist when:

You migrate databases, drivers, frameworks, or hosting environments.
You add a new import or export path such as CSV, spreadsheet, feed, or webhook processing.
You expand language support, add emoji-heavy user input, or launch in new regions.
You change CMS, templating, static-site generation, or proxy layers.
You add search indexing, ETL, analytics pipelines, or log processing that touch text.
You notice support tickets mentioning weird symbols, question-mark diamonds, or copied text behaving differently across apps.
You are doing a scheduled workflow review before a major release or planning cycle.

For a practical maintenance routine, keep a small multilingual text fixture in your test workflow. Include accented words, smart punctuation, emoji, right-to-left text, and a few edge cases with combining marks. Pass that fixture through every critical path: form submission, API, database write and read, export, search indexing, and rendering. If the fixture changes unexpectedly at any point, investigate before the issue reaches user content.

Also maintain a short runbook:

Where to capture raw bytes
Which headers and settings to inspect first
Which repair patterns are known to be safe in your environment
Who owns the database, application, and integration layers
How to validate repaired text before rollout

Mojibake is rarely solved by a single setting in isolation. It is solved by tracing text carefully across boundaries and treating encoding assumptions as part of the system contract. If you return to this checklist whenever tools, imports, or language coverage change, you will catch most broken text encoding issues early and repair the rest with far less guesswork.

For broader Unicode maintenance, it is also worth reviewing Unicode Version History and Adoption Tracker when platform support changes, especially if your application depends on current emoji behavior or newer script support.

How to Detect Mojibake and Fix Broken Text Encoding

Overview

Checklist by scenario

1. The browser page shows broken characters, but the source data may still be correct

2. The database already contains garbled text

3. A CSV, spreadsheet export, or text file looks fine in one app and broken in another

4. API payloads or webhooks contain broken text

5. Emoji or non-Latin scripts are especially broken

6. You need to repair text, not just diagnose it

What to double-check

Trace the first bad hop

Compare bytes, code points, and rendered text

Normalization is separate from encoding

Legacy defaults still appear in modern stacks

Replacement characters usually mean some data is gone

Script direction and invisible controls can hide the real issue

Common mistakes

When to revisit

Related Topics

Unicode.live Editorial

Up Next

How to Encode and Decode URLs with Non-ASCII Characters

How to Compare Browser-Based Unicode Tools for Daily Dev Work

Unicode Block Reference: Find Characters by Range and Script