Each word is represented not just as a string, but as a vector (like a point in a 10,000 dimensional space)
The “curve” which the model is drawing is now living in this high-dimensional space. It’s no longer “x-axis versus y-axis”, but it’s rather:
“10,000 dimensions of word meaning versus the probability of the next word”
Training an LLM means showing the neural network billions of documents - text, videos, images, etc to arrive at a set of billions of weights which have a very high probability of predicting the next token (or a set of tokens). This works very well for language because this approach captures: