Glossary
Tokenization
Also known as: Tokens, BPE
Definition
In language AI, tokenization is the preprocessing step that breaks input text into tokens. Modern LLMs use subword schemes such as Byte-Pair Encoding (BPE) or SentencePiece, which handle unknown words robustly. In English, one token corresponds to roughly three to four characters; German and non-Latin languages often produce more tokens per word. Billing, context-window usage, and latency of LLM APIs are typically measured in tokens.
How Swiss Knowledge Hub uses this term
Swiss Knowledge Hub surfaces per-chat and per-workspace token usage in the admin UI so that cost transparency and quota management are possible — regardless of which LLM or BYOK configuration is in use.
Related terms
Sources
- Wikipedia: Byte pair encoding — https://en.wikipedia.org/wiki/Byte_pair_encoding
- OpenAI — What are tokens and how to count them — https://help.openai.com/en/articles/4936856-what-are-tokens-and-how-to-count-them
Last updated: April 22, 2026