Tokens vs Bytes in AI: What LLMs Actually See When You Type

You type "你好 Hello" into GPT-5. That's 7 characters. But the model processes it as 2 tokens — and your bill is based on those tokens, not the characters.

Meanwhile, your computer stores that same text as 12 bytes.

So what's the difference between bytes, characters, and tokens? Why does AI use tokens instead of raw bytes? And why does the same sentence cost more in Chinese than in English?

What Is a Byte?

A byte is the smallest unit of data your computer stores. One byte = 8 bits = a number from 0 to 255.

Character	UTF-8 Bytes	Byte Count
`H`	72	1
`你`	228, 189, 160	3
`🚀`	240, 159, 154, 128	4

English letters: 1 byte each
CJK characters: 3 bytes each
Emojis: 4 bytes each

Four Levels: Bytes → Characters → Words → Tokens

Level	"Hello, World"	Count	Description
Bytes	`48 65 6c 6c 6f...`	12	Raw storage
Characters	`H e l l o , W o r l d`	12	Human-readable
Words	`Hello, World`	2	Space-separated
Tokens	`Hello` `,` `World`	3	AI processes these

Tokens are not bytes, not characters, and not words. They're sub-word units balancing vocabulary size with sequence length.

Why Not Bytes or Words?

Bytes: sequences too long. Transformer attention is O(n²) — 3-4x longer = 9-16x more compute.

Words: vocabulary explodes to millions. Unknown words can't be processed (OOV problem).

Tokens: the sweet spot.

"unbelievable" → ["un", "bel", "ievable"]  (3 tokens)
"Hello"        → ["Hello"]                  (1 token)

How BPE Tokenization Works

Split into characters
Find most frequent adjacent pair
Merge into new token
Repeat thousands of times

import tiktoken

enc = tiktoken.get_encoding("o200k_base")
text = "你好 Hello"
tokens = enc.encode(text)
print(f"Tokens ({len(tokens)}): {[enc.decode([t]) for t in tokens]}")
# Output: Tokens (2): ['你好', ' Hello'] — 12 bytes → 2 tokens!

Different Models, Different Costs

Text	GPT-4 tokens	GPT-5 tokens	Savings
Chinese sentence	15	9	40%
Japanese sentence	12	10	17%

Chinese costs ~50% more than English for the same character count. Japanese is the most token-efficient CJK language.

Compare models easily

from openai import OpenAI

client = OpenAI(
    api_key="your-crazyrouter-key",
    base_url="https://crazyrouter.com/v1"
)

for model in ["gpt-5", "deepseek-v3.2", "claude-sonnet-4"]:
    resp = client.chat.completions.create(
        model=model,
        messages=[{"role": "user", "content": "Explain tokenization"}],
        max_tokens=100
    )
    print(f"{model}: {resp.usage.prompt_tokens} in / {resp.usage.completion_tokens} out")

With Crazyrouter, one API key → 627+ models.

Quick Reference

1 English token ≈ 4 characters ≈ 0.75 words
1,000 tokens ≈ 750 English words
Chinese costs ~50% more tokens than English

Full version: Crazyrouter Blog

Tokens vs Bytes in AI: What LLMs Actually See When You Type

What Is a Byte?

Four Levels: Bytes → Characters → Words → Tokens

Why Not Bytes or Words?

How BPE Tokenization Works

Different Models, Different Costs

Compare models easily

Quick Reference

Comments

More from this blog

DeepSeek V4 Complete Guide — 1.6T MoE with 1M Context at 73% Lower Cost

DeepSeek R2: The 32B Reasoning Model That Runs on a Single GPU — Developer Guide (2026)

GPT-6 API Release Date: What Developers Should Watch Before OpenAI Ships It

Xiaomi MiMo-V2-Pro vs Claude in Production: Real Tests Through Crazyrouter

Command Palette

What Is a Byte?

Four Levels: Bytes → Characters → Words → Tokens

Why Not Bytes or Words?

How BPE Tokenization Works

Different Models, Different Costs

Compare models easily

Quick Reference

Comments

More from this blog