Web Development, Security, Privacy, Vue·

Building a HTML Encode Tool

How we built a deterministic, standards-compliant HTML encoding utility that runs entirely in the browser — no servers, no tracking, no compromises.

HTML encoding is one of those fundamentals that every web developer learns early but rarely thinks about deeply. When we sat down to build the HTML Encode tool for CodeCultivation, what looked like a straightforward textarea-in, entities-out problem turned out to be a precise engineering exercise — entity maps, Unicode-safe iteration, double-encode prevention, and a strict security mode all needed careful thought. Here is how we did it.

Why HTML Encoding Matters

When user-supplied content is inserted into an HTML document without encoding, browsers may interpret it as markup rather than text. This is the root cause of Cross-Site Scripting (XSS) — one of the most prevalent web security vulnerabilities.

Consider a comment field that accepts <script>alert("XSS")</script>. If stored and rendered verbatim, every visitor who views that comment executes the script. Encoding the string to &lt;script&gt;alert(&quot;XSS&quot;)&lt;/script&gt; transforms it from executable markup into inert display text.

The OWASP XSS Prevention Cheat Sheet puts encoding as Rule #1: "HTML encode all untrusted data before placing it into HTML body content." Our tool makes that transformation immediate and verifiable.

The Five Critical Characters

Only five characters carry special meaning in HTML contexts and must always be encoded:

CharacterNamed entityDecimalHexadecimal
&&amp;&#38;&#x26;
<&lt;&#60;&#x3C;
>&gt;&#62;&#x3E;
"&quot;&#34;&#x22;
'&#39;&#39;&#x27;

One subtlety: the apostrophe ' has no standard named entity in HTML. &apos; is defined in XML and XHTML but is not part of the HTML Living Standard. Using &#39; — the decimal numeric form — in named mode is the correct and widely-used convention.

Named vs Decimal vs Hexadecimal — When Each Is Right

All three formats are universally understood by every HTML parser, so the choice is mostly about context and readability:

  • Named (&lt;, &amp;) — the most human-readable format and the conventional choice for hand-authored HTML. Developers scanning a document can immediately recognise what character is represented.
  • Decimal numeric (&#60;, &#38;) — useful when you need a format that works in any XML-family document without relying on a named entity dictionary. Also easier to verify by hand than hex.
  • Hexadecimal numeric (&#x3C;, &#x26;) — common in generated code and security tooling. The hex values map directly to Unicode code points, which can make debugging character set issues more intuitive.

Our tool lets you switch between all three with immediate re-encoding so you can see the difference on your actual content.

The Entity Preservation Problem

A naive encoding pass that replaces every & with &amp; immediately breaks content that already contains valid HTML entities. Paste &amp;lt; into such a tool and you get &amp;amp;lt; — broken double-encoded output.

The correct approach: split the input on a pattern that captures existing entities, then only encode the non-entity segments.

const ENTITY_PATTERN = /(&[a-zA-Z0-9#x]+;)/g;

function encodeHtml(text: string): string {
  if (!text) return "";
  if (forceReEncode.value) {
    return [...text].map(encodeChar).join("");
  }
  return text
    .split(ENTITY_PATTERN)
    .map((part, i) => (i % 2 === 1 ? part : [...part].map(encodeChar).join("")))
    .join("");
}

The key is the capture group in the regex. When String.prototype.split() receives a regex with a capture group, it includes the matched segments in the result array. Odd-indexed items are entity matches — pass them through unchanged. Even-indexed items are non-entity text — encode them character by character.

This rule correctly handles:

  • &amp; → preserved as &amp; (no double-encode)
  • &lt (no semicolon) → &amp;lt (the & is encoded since it is not a valid entity)
  • & foo; (space in name) → &amp; foo; (space disqualifies it from the pattern)
  • &#60; and &#x3C; → preserved (numeric entities also match)

The Force Re-Encode toggle deliberately bypasses this logic for cases where you want to encode an entire string including its entities — useful when preparing content for display inside a code block.

Strict Security Mode

The five critical characters cover the vast majority of injection vectors. However, OWASP Rule #4 recommends encoding all non-alphanumeric characters for attributes and certain script contexts. Our Strict Mode implements this.

function encodeChar(char: string): string {
  const map =
    entityFormat.value === "named"
      ? NAMED_MAP
      : entityFormat.value === "decimal"
        ? DECIMAL_MAP
        : HEX_MAP;

  if (map[char]) return map[char];

  if (strictMode.value && /[^a-zA-Z0-9]/.test(char)) {
    const cp = char.codePointAt(0)!;
    return `&#${cp};`;
  }

  return char;
}

In strict mode, characters like / (→ &#47;), = (→ &#61;), and the backtick (→ &#96;) are all encoded as decimal numeric entities. We chose decimal over hexadecimal for the strict-mode extras because decimal is easier to verify manually — a human checking the output can more quickly confirm &#47; is / than &#x2F;.

The five critical characters continue to use whichever format is currently selected; strict mode only affects the additional characters that have no named entity.

Unicode and Surrogate Pairs

JavaScript strings are UTF-16. Most characters fit in a single 16-bit code unit, but characters outside the Basic Multilingual Plane — emoji, historic scripts, many mathematical symbols — are represented as surrogate pairs: two code units that together encode a single Unicode code point.

If you iterate a string with a numeric index or charAt(), you see individual code units, which means emoji and other supplementary characters get split in half and corrupted. The fix is to use the spread iterator [...text], which correctly walks the string by Unicode code point:

// Correctly handles emoji and supplementary characters
return [...text].map(encodeChar).join("");

// In encodeChar, use codePointAt(0) rather than charCodeAt(0)
// codePointAt() returns the full code point even for surrogate pairs
const cp = char.codePointAt(0)!;
return `&#${cp};`;

For example, 😀 is U+1F600 — a single code point represented as two UTF-16 code units. [...text] gives you the whole emoji as a single element; codePointAt(0) gives you 128512, the correct decimal code point.

Privacy-First Processing

The tool performs all encoding directly in the browser. There is no API call, no telemetry, no server-side component. The encoding functions are pure TypeScript — they take a string in and return a string out.

This design choice matters when the content being encoded is sensitive: API keys, authentication tokens, private HTML fragments, internal configuration snippets. With client-side-only processing, that data never leaves the user's machine.

The constraint also means the tool works offline once the page has loaded, adds zero server cost per operation, and scales to any number of simultaneous users without infrastructure changes.

You can verify this directly: open browser developer tools, switch to the Network tab, paste content into the input field, and confirm that no network requests are made.

Key Takeaways

  • Encode & first (or in a single pass) — encoding < before & would re-encode the & in the resulting &lt;.
  • The capture-group split pattern is the elegant solution to the double-encode problem.
  • Named entities are most readable; decimal is most universal; hex maps directly to code points.
  • [...text] and codePointAt() are the correct tools for Unicode-safe string processing in JavaScript.
  • Client-side-only architecture is not a limitation — for encoding utilities, it is the right design.

Code Cultivation • © 2026