What would happen if we encountered an alien species that only understood numbers? How would we communicate with them? Each of our words would first need to be translated into mathematical code before being processed by our interlocutors. This metaphor perfectly illustrates the challenge facing artificial intelligence systems today. The solution to this complex situation lies in a fundamental process: tokenisation.
When we interact with ChatGPT, ask questions to a virtual assistant, or use a conversational AI tool, we witness the result of a complex translation process that converts human language into machine-readable numerical sequences and vice versa. At the heart of this transformation are tokens: the elementary building blocks that bridge human communication and artificial intelligence.
Lost in Translation
The first confusion that needs clearing up concerns the very nature of tokens. Contrary to intuition, tokens and words are not synonymous. Tokenisation constitutes an encoding process specific to AI, as neural networks only function with numbers, never with words. This numerical transformation is therefore indispensable for enabling machines to process human language.
This conversion begins during the training phase of AI models, during which enormous volumes of text undergo processing by statistical algorithms. In practice, teams often underestimate this step, even though it determines the final quality of the system. The process involves cleaning the text
(removing punctuation, normalisation), then deploying algorithms that count occurrences of letter groups according to their frequency of appearance.
Building the Machine's Lexicon
This statistical analysis generates what we call a "token dictionary." This is a repository that associates each linguistic element (character, group of characters, or entire word) with a unique numerical identifier. Imagine a gigantic correspondence table where "the" might become "1", "cat" become "247", and "intelligence" become "15892". This database can contain between 50,000 and 100,000 entries, each representing a language fragment that the model can recognise and manipulate.
The effectiveness of this approach depends entirely on the representativeness of the training corpus. The more frequently a word or sequence appears in the training data, the more likely it is to obtain a specific token and be correctly understood by the system. Conversely, rare terms risk being fragmented into smaller units, thus losing part of their contextual meaning.
All You Need is… English!
This statistical approach reveals a major imbalance in the current AI ecosystem. The predominance of English in training corpora (approximately 90% of data) creates a dictionary optimised for this language to the detriment of others. This situation is not insignificant: it reflects the cultural and economic domination of English-speaking countries in digital content production.
The practical consequences of this bias manifest immediately in usage costs. French, for example, requires on average 20% more tokens than English to express the same concept. This difference is explained by less efficient segmentation: where English has a unique token for a common word, French sees the same concept cut into several fragments, multiplying processing costs.
Beyond the financial aspect, this asymmetry introduces cultural and linguistic biases into generated responses. Models naturally privilege Anglo-Saxon turns of phrase and references, impoverishing the richness of French expressions or other languages.
When AI Meets Corporate Gobbledygook
Adapting to enterprise terminologies represents a complex challenge. Token dictionaries, built on public corpora, ignore specialised vocabularies: internal acronyms, proprietary application names, sectoral jargon. This gap generates systematic malfunctions.
Take the example of an enterprise application named "FINANCE-HR-2024". A standard AI system will fragment this designation into multiple tokens: "FIN", "ANCE", "-H", "R", "-", "202", "4", meaning at least seven distinct elements. This fragmentation dilutes meaning and multiplies processing costs, whilst increasing the risk of hallucinations or erroneous interpretations.
To address this limitation, we develop fine-tuning strategies that enrich the base vocabulary with client terminology. This approach allows significant improvement in response precision whilst optimising operational costs.
Memory Lane Has Speed Limits
Understanding token limits becomes crucial during large-scale AI implementations. Each model has a "context window"—a maximum limit on the number of tokens it can process simultaneously.
Exceeding this limit can either trigger request rejection or truncated processing without explicit warning.
More perniciously, research reveals a phenomenon called "lost in the middle": models accord more importance to information located at the beginning and end of context, relatively neglecting central content. This characteristic follows a U-curve that invalidates the naïve approach of "feeding everything to the model for summarisation".
Faced with this constraint, we develop workaround methodologies: map-reduce techniques (block-by-block summarisation then synthesis of summaries), iterative approaches (progressive summarisation with conservation of essential elements), or intelligent content segmentation according to contextual importance.
The Executive's Survival Guide
For companies seeking to implement conversational AI, we can formulate several essential recommendations:
- Conceptual Mastery: The distinction between words and tokens conditions any rational AI approach. This understanding must permeate technical teams and business decision-makers alike, as it directly impacts budgets and performance.
- Linguistic Awareness: Treatment asymmetry between languages necessitates adapted budget planning. Multilingual organisations must integrate these additional costs into their economic models and consider targeted fine-tuning strategies.
- Prompt Optimisation: The temptation to maximise context window usage proves counterproductive. A structured, segmented, and progressive approach generates better results than throwing everything at the wall to see what sticks.
- Operational Monitoring: Implementing token monitoring tools—both input and output—enables fine-grained cost and performance management. This technical supervision must become second nature.
The Next Chapter: Beyond Tokens
The tokenisation ecosystem is evolving towards greater sophistication. Emerging approaches explore character-level tokenisation, multimodal representations (text, image, audio in unified space), or architectures that partially break free from traditional token constraints.
However, these innovations remain the preserve of major research laboratories today. For organisations deploying AI, mastering current mechanisms remains the priority. Because understanding tokenisation means cracking the code of contemporary artificial intelligence's economic and technical foundations.