Tokens vs Bytes in AI: What LLMs Actually See When You Type

You type "你好 Hello" into GPT-5. That's 7 characters. But the model processes it as 2 tokens — and your bill is based on those tokens, not the characters.
Meanwhile, your computer stores that same text as 12 bytes.
So what's the difference between bytes, characters, and tokens? Why does AI use tokens instead of raw bytes? And why does the same sentence cost more in Chinese than in English?
What Is a Byte?
A byte is the smallest unit of data your computer stores. One byte = 8 bits = a number from 0 to 255.
| Character | UTF-8 Bytes | Byte Count |
H | 72 | 1 |
你 | 228, 189, 160 | 3 |
🚀 | 240, 159, 154, 128 | 4 |
- English letters: 1 byte each
- CJK characters: 3 bytes each
- Emojis: 4 bytes each
Four Levels: Bytes → Characters → Words → Tokens
| Level | "Hello, World" | Count | Description |
| Bytes | 48 65 6c 6c 6f... | 12 | Raw storage |
| Characters | H e l l o , W o r l d | 12 | Human-readable |
| Words | Hello, World | 2 | Space-separated |
| Tokens | Hello , World | 3 | AI processes these |
Tokens are not bytes, not characters, and not words. They're sub-word units balancing vocabulary size with sequence length.
Why Not Bytes or Words?
Bytes: sequences too long. Transformer attention is O(n²) — 3-4x longer = 9-16x more compute.
Words: vocabulary explodes to millions. Unknown words can't be processed (OOV problem).
Tokens: the sweet spot.
"unbelievable" → ["un", "bel", "ievable"] (3 tokens)
"Hello" → ["Hello"] (1 token)
How BPE Tokenization Works
- Split into characters
- Find most frequent adjacent pair
- Merge into new token
- Repeat thousands of times
import tiktoken
enc = tiktoken.get_encoding("o200k_base")
text = "你好 Hello"
tokens = enc.encode(text)
print(f"Tokens ({len(tokens)}): {[enc.decode([t]) for t in tokens]}")
# Output: Tokens (2): ['你好', ' Hello'] — 12 bytes → 2 tokens!
Different Models, Different Costs
| Text | GPT-4 tokens | GPT-5 tokens | Savings |
| Chinese sentence | 15 | 9 | 40% |
| Japanese sentence | 12 | 10 | 17% |
Chinese costs ~50% more than English for the same character count. Japanese is the most token-efficient CJK language.
Compare models easily
from openai import OpenAI
client = OpenAI(
api_key="your-crazyrouter-key",
base_url="https://crazyrouter.com/v1"
)
for model in ["gpt-5", "deepseek-v3.2", "claude-sonnet-4"]:
resp = client.chat.completions.create(
model=model,
messages=[{"role": "user", "content": "Explain tokenization"}],
max_tokens=100
)
print(f"{model}: {resp.usage.prompt_tokens} in / {resp.usage.completion_tokens} out")
With Crazyrouter, one API key → 627+ models.
Quick Reference
- 1 English token ≈ 4 characters ≈ 0.75 words
- 1,000 tokens ≈ 750 English words
- Chinese costs ~50% more tokens than English
Full version: Crazyrouter Blog