3476, 477, 12274, 112838, 248
Introduction When working with Large Language Models, we often focus on their remarkable capabilities - from writing code to explaining complex concepts. However, there’s a crucial component that can significantly impact their behavior and performance: tokenization 🍣. As highlighted in a recent work by Garreth Lee and the Hugging Face team 🤗 1, even state-of-the-art models can stumble on seemingly simple tasks due to tokenization choices. For instance, many models struggle with the basic question “Which is bigger?...