Turning Language into Data

asimd23 · Post by **asimd23** » Wed Feb 12, 2025 1:20 pm

Yet, language is highly context-based and nuanced. Think about how a word’s use changes over time, or how the use of irony or a play on words changes meaning. Think of the inside joke that only you and your best friend “get” because you had to be there. Our lives are filled with examples of how language relies on context or shared social understanding. Language is the opposite of data.

From Tolkien to Token:
“Not all those who wander are lost.” – JRR Tolkien

Natural language processing involves turning language france whatsapp number data into formats a machine can understand (numbers), before turning it back into our desired human output (text, code, etc). One of the first steps in the process of “datafying” language is to break it down into tokens. Tokens are typically a single word, at least in English – more on that in a minute.

For example, our Tolkien sentence would tokenize as:

There are various tools for tokenization and these might break down longer words slightly differently.

Tokens are important because they not only drive performance of the model they also drive training costs. AI companies charge developers by the token. English tends to be the most token-efficient language, making it economically advantageous to train on English language “data” versus, say, Burmese. This blog post by data scientist Yennie Jun goes into further details about how the process works in a very accessible way, and this tool she built allows you to select different languages along with different tokenizers to see exactly how many tokens are needed for each of the languages selected.