Skip to main content

Command Palette

Search for a command to run...

Tokens vs Bytes in AI: What LLMs Actually See When You Type

Published
3 min read
Tokens vs Bytes in AI: What LLMs Actually See When You Type

You type "你好 Hello" into GPT-5. That's 7 characters. But the model processes it as 2 tokens — and your bill is based on those tokens, not the characters.

Meanwhile, your computer stores that same text as 12 bytes.

So what's the difference between bytes, characters, and tokens? Why does AI use tokens instead of raw bytes? And why does the same sentence cost more in Chinese than in English?

What Is a Byte?

A byte is the smallest unit of data your computer stores. One byte = 8 bits = a number from 0 to 255.

CharacterUTF-8 BytesByte Count
H721
228, 189, 1603
🚀240, 159, 154, 1284
  • English letters: 1 byte each
  • CJK characters: 3 bytes each
  • Emojis: 4 bytes each

Four Levels: Bytes → Characters → Words → Tokens

Level"Hello, World"CountDescription
Bytes48 65 6c 6c 6f...12Raw storage
CharactersH e l l o , W o r l d12Human-readable
WordsHello, World2Space-separated
TokensHello , World3AI processes these

Tokens are not bytes, not characters, and not words. They're sub-word units balancing vocabulary size with sequence length.

Why Not Bytes or Words?

Bytes: sequences too long. Transformer attention is O(n²) — 3-4x longer = 9-16x more compute.

Words: vocabulary explodes to millions. Unknown words can't be processed (OOV problem).

Tokens: the sweet spot.

"unbelievable" → ["un", "bel", "ievable"]  (3 tokens)
"Hello"        → ["Hello"]                  (1 token)

How BPE Tokenization Works

  1. Split into characters
  2. Find most frequent adjacent pair
  3. Merge into new token
  4. Repeat thousands of times
import tiktoken

enc = tiktoken.get_encoding("o200k_base")
text = "你好 Hello"
tokens = enc.encode(text)
print(f"Tokens ({len(tokens)}): {[enc.decode([t]) for t in tokens]}")
# Output: Tokens (2): ['你好', ' Hello'] — 12 bytes → 2 tokens!

Different Models, Different Costs

TextGPT-4 tokensGPT-5 tokensSavings
Chinese sentence15940%
Japanese sentence121017%

Chinese costs ~50% more than English for the same character count. Japanese is the most token-efficient CJK language.

Compare models easily

from openai import OpenAI

client = OpenAI(
    api_key="your-crazyrouter-key",
    base_url="https://crazyrouter.com/v1"
)

for model in ["gpt-5", "deepseek-v3.2", "claude-sonnet-4"]:
    resp = client.chat.completions.create(
        model=model,
        messages=[{"role": "user", "content": "Explain tokenization"}],
        max_tokens=100
    )
    print(f"{model}: {resp.usage.prompt_tokens} in / {resp.usage.completion_tokens} out")

With Crazyrouter, one API key → 627+ models.

Quick Reference

  • 1 English token ≈ 4 characters ≈ 0.75 words
  • 1,000 tokens ≈ 750 English words
  • Chinese costs ~50% more tokens than English

Full version: Crazyrouter Blog

2 views