Fewer Characters Don't Equal Fewer Tokens

What I tried

I wrote a small harness that feeds famous paragraphs into GPT-4 three different ways:

Mandarin translation
Emoji or mixed-symbol "compression"
Caveman English gist

The script sits here: llm_tokenization_test.
It loops through every paragraph, sends the text with a custom system prompt, grabs the model output, then counts tokens with tiktoken.

for system in systems:
    messages = [system, {"role": "user", "content": paragraph}]
    converted = completion(messages, model="gpt-4")
    count = len(enc.encode(converted))

What I expected

Fewer visible characters should mean fewer tokens. Chinese needs one glyph per word. Emoji can hold a whole concept inside a single icon. So both should beat the caveman rewrite.

What actually happened

Style tested

Average token count vs English

Caveman English

65 percent

Mandarin

110 percent

Emoji mix

125 percent

Caveman won by a wide margin. Chinese and emoji cost more.

Why characters are not tokens

1. GPT uses Byte Pair Encoding

The tokenizer was trained on mostly English text. Common Latin fragments like tion or polit merge into single tokens. Rare Unicode symbols never merge. Each Chinese character becomes one token and many emoji expand into two.

2. Low frequency symbols lose compression

Emoji and CJK bytes rarely co-occur in the WebText corpus. The BPE trainer never learned longer merges for them, so they stay expensive.

3. Emoji often triggers extra text

When GPT rewrites a sentence with emoji it tends to add spaces, modifiers, or explanatory words so the meaning is not lost. Each addition introduces extra tokens.

4. Mandarin keeps the grammar words

A faithful translation still needs 的, 是, 了, and so on. Those small characters are each tokens that English caveman simply deletes.

5. Caveman drops information-light words

The caveman prompt removes articles, auxiliary verbs, relative clauses and fancy adjectives. What remains are high-value nouns and verbs that the tokenizer already packs efficiently.

Other tricks I measured

Strategy

Effect

ASCII abbreviations like gov intl dept

saved 10 to 20 percent

Removing quotation marks and line breaks

small win

Pure symbol encoding

almost always larger because of token overhead

Takeaways

When you pay by the token you want fewer subwords, not fewer characters. The safest cheapening move is to rewrite in blunt English and strip fluff. Chinese and emoji are visually dense but tokenizer unfriendly.

Try the repo, swap in your own text, and watch the counter roll.

git clone https://github.com/patrickmaub/llm_tokenization_test
python run_test.py