Patrick Mauboussin

AI, Healthcare, Business

← Back to home

Fewer Characters Don't Equal Fewer Tokens

What I tried

I wrote a small harness that feeds famous paragraphs into GPT-4 three different ways:

  • Mandarin translation
  • Emoji or mixed-symbol "compression"
  • Caveman English gist

The script sits here: llm_tokenization_test.
It loops through every paragraph, sends the text with a custom system prompt, grabs the model output, then counts tokens with tiktoken.

for system in systems:
    messages = [system, {"role": "user", "content": paragraph}]
    converted = completion(messages, model="gpt-4")
    count = len(enc.encode(converted))

What I expected

Fewer visible characters should mean fewer tokens. Chinese needs one glyph per word. Emoji can hold a whole concept inside a single icon. So both should beat the caveman rewrite.

What actually happened

Style tested
Average token count vs English
Caveman English
65 percent
Mandarin
110 percent
Emoji mix
125 percent

Caveman won by a wide margin. Chinese and emoji cost more.

Why characters are not tokens

1. GPT uses Byte Pair Encoding

The tokenizer was trained on mostly English text. Common Latin fragments like tion or polit merge into single tokens. Rare Unicode symbols never merge. Each Chinese character becomes one token and many emoji expand into two.

2. Low frequency symbols lose compression

Emoji and CJK bytes rarely co-occur in the WebText corpus. The BPE trainer never learned longer merges for them, so they stay expensive.

3. Emoji often triggers extra text

When GPT rewrites a sentence with emoji it tends to add spaces, modifiers, or explanatory words so the meaning is not lost. Each addition introduces extra tokens.

4. Mandarin keeps the grammar words

A faithful translation still needs 的, 是, 了, and so on. Those small characters are each tokens that English caveman simply deletes.

5. Caveman drops information-light words

The caveman prompt removes articles, auxiliary verbs, relative clauses and fancy adjectives. What remains are high-value nouns and verbs that the tokenizer already packs efficiently.

Other tricks I measured

Strategy
Effect
ASCII abbreviations like gov intl dept
saved 10 to 20 percent
Removing quotation marks and line breaks
small win
Pure symbol encoding
almost always larger because of token overhead

Takeaways

When you pay by the token you want fewer subwords, not fewer characters. The safest cheapening move is to rewrite in blunt English and strip fluff. Chinese and emoji are visually dense but tokenizer unfriendly.

Try the repo, swap in your own text, and watch the counter roll.

git clone https://github.com/patrickmaub/llm_tokenization_test
python run_test.py

Thanks for reading!

Written by Patrick Mauboussin on June 23, 2023