How does ChatGPT work?

I think part of the reason ChatGPT is hard to understand is that we lack the foundational concepts to build upon when reasoning about it.

How does ChatGPT work?
Juju taking a nap, and decorating the couch with fur
Any sufficiently advanced technology is indistinguishable from magic. Arthur C. Clarke's 3rd law

For all of us, ChatGPT seems a bit like magic: it's hard for us to really grasp how it can go from a prompt to a reply.

I think part of the reason ChatGPT is hard to understand is that we lack the foundational concepts to build upon when reasoning about it. Like a kid left behind in math class, we stumble by, building confusion on top of confusion.

So here's my explanation of how ChatGPT works by explaining its underlying concepts.

Please don't tell me it's predicting the next word

I hate the "it works by predicting the next word" explanation.

Saying ChatGPT predicts the next word is of no help in understanding how it works. The examples of "to be or not to .." "be" just further confused me. How would a computer predict that? And how does it go from that to writing poetry about my tax returns?

Like the famous chess quote, "I can only see one move ahead.. but it's always the best one!", saying ChatGPT works by predicting the next word is like explaining how to play chess by saying "Just predict the best next move!" – technically right, but useless as an explanation.

The other variation of explanation is then talking about large language models, tokens, weights, transformers, and context windows – as if any of that made any sense to us.

Talking about those LLM implementation details to explain ChatGPT confuses explaining how something works with explaining how something is implemented.

Like saying "A computer CPU has registers that store binary numbers in them. Then CPU operations change these registers again, and again, and again, writing and reading these binary numbers from memory – and so the computer works!" it gives you a glimpse of how a computer is implemented, but not how come you can watch an infinite stream of cat videos on it.

So, how does ChatGPT work? Let's start by understanding how to make very simple guesses.

Guessing a house price given its size

A common and intuitive example frequently used in introductory machine learning is that of guessing house prices given the house's size.

Let's say you have two houses for sale in Austin:

size price
1,500 sq ft $350,000
2,500 sq ft $600,000

Looking at these two examples, you can find a correlation between size and price – each square foot costs $230-$240, probably more the bigger the house.

A simple machine learning model will help you make these types of guesses. Give it enough examples of existing house prices, what we call "train the model," and eventually you can ask it questions:

"How much for a 1,500 sq ft house?" – "About $350,000 I think."

"How about for a 2,500 sq ft house?" – "About $600,000 or so."

Of course it's seen houses the same exact size before, so it's "more confident" in those guesses. But even if it hasn't seen some particular sizes, given enough training, it can start making pretty educated guesses:

"What about a 2,000 sq ft house?" – "About $470,000, give or take."

That's what machine learning does: educated guesses based on data we've trained it on about things it hasn't seen before – such as guessing house prices based on their size.

But what if there are more things influencing the price of a house than its size?

And what if there are more houses for sale than just 2 or a handful?

Using WAY more data to guess results

When people say deep learning, they're really talking about an implementation detail.

Deep learning is no different from machine learning – a way to make educated guesses. It's just a better way of making guesses when there's way more data involved.

Let's say besides size, there are many other factors that play into the price of a house: city, neighborhood, size of construction, number of bathrooms, year of construction, last renovation, sales on adjacent houses, time of sale, etc. Figuring out which of these influence the price of the house, and by how much, can get complicated very quickly for us.

Deep learning is a break-through technique that helped us make problems several orders of magnitude harder than that into tractable problems, with more accurate guesses. By organizing our machine learning model in a particular way, we could increase the sophistication of the model, the parameters in the data, and amount of data our models could ingest to create better educated guesses despite the overwhelming amount of data.

The most common example used to show the superiority of deep learning compared to prior techniques is computer vision. In computer vision, the amount of data processed is (relatively) massive, and making sense of it to create accurate guesses can be much harder than house size and prices.

AlexNet was built by the legendary researcher Geoffrey Hinton and two of his students, Alex Krizhevsky and Ilya Sutskever, at the University of Toronto. They entered the ImageNet Large Scale Visual Recognition Challenge, an annual competition designed by the Stanford professor Fei-Fei Li to focus the field’s efforts around a simple goal: identifying the primary object in an image. Each year competing teams would test their best models against one another, often beating the previous year’s submissions by no more than a single percentage point in accuracy.

In 2012, AlexNet beat the previous winner by 10 percent. It may sound like a small improvement, but to AI researchers this kind of leap forward can make the difference between a toylike research demo and a breakthrough on the cusp of enormous real-world impact. [..]

Thanks to deep learning, computer vision is now everywhere, working so well it can classify dynamic real-world street scenes with visual input equivalent to twenty-one full-HD screens, or about 2.5 billion pixels per second, accurately enough to weave an SUV through busy city streets.

Suleyman, Mustafa. The Coming Wave (pp. 80-81). Crown. Kindle Edition.

With deep learning, we could now create more accurate guesses on problems with data that was intractable before, such as how likely is someone to click an ad based on trillions of previous searches and visits, or how likely is someone to watch a movie given billions of movie watching histories and ratings.

Just like you can "easily" guess a house price given its size if you know roughly the square footage cost and variation, deep learning models can "easily" get an accurate guess about many future things, given you feed it enough about the "things" that happened before, no matter what, how complicated, or how big the "thing" is.

So deep learning revolutionized our ability to deal with tons of data in order to make better guesses, which came in handy because the internet age allowed us to capture tons of data.

But how do we go from making better guesses about "things" to generating words?

Guessing how words relate to each other through analogies

One way machine learning models can work when processing words, such as "man" and "woman," instead of numbers like "1,500 sq ft" and "$350,000" is by drawing analogies.

That's right! Analogies!

Models that use words like man and woman instead of numbers like 1,500 and $350,000 use something similar to analogies to find the relationship between words when making guesses.

For example, if instead of creating a model with size and price of houses, you created it with a fairy tale story that went like this:

"The woman finally married the man, and they had a daughter. The man was unhappy with the king, who was married to the queen. The king and queen had a daughter, who was a princess, and the princess was looking for a prince to marry."

This text shows a number of relationships: that man and woman get married, that king and queen can get married, and that princess and prince can also get married.

When your model starts understanding this relationship, then you can start asking it to make guesses about words, which it will try to do through analogies, not unlike the links between the size and price of houses.

"If the man is a king, then the woman is a what?" "I guess the woman is a queen."

"What if a woman is a princess, what is the man?" "Then the man is a prince"

So your model sees the connection between the words when using the verb "married" to draw relationships between words; in this case, a gender relationship.

Once your model is sophisticated enough, given enough words to find relationships around, it starts learning a lot of ways in which words interact: grammar, length of sentences, synonyms and antonyms, that different tenses exist, etc.

Like the breakthrough of going from having a house's size and guessing its price to having trillions of searches to guessing if you'll click an ad, deep learning helped us make big breakthroughs in the sophistication of guesses about how words relate to each other.

One common use case for this relationship between words for deep learning was translations. When you used google translate to translate "The king and the queen" to Spanish, it'd find the "analogy" in the Spanish language to those English words and see which relationships were similar, and come up with "El rey y la reina."

Google Translate works the same way as the fairy tale, by drawing an analogy. The analogy was [man and woman] and [king and queen] in the fairy tale case, and between [All of English] and [All of Spanish] in the language translation.

We're getting close.

Now, how do we go from being able to translate texts from English to Spanish by using word analogies and their connections to magically responding coherently to questions?

God, my smartphone's autocomplete sucks

I have the ultimate first-world problem: I don't use a phone except on Saturday mornings, prevent it from capturing any personal data, and I have it configured for 3 languages.

So my autocomplete sucks. It seems to never know what I want to say.

Autocomplete knows language: it knows grammar, knows how words relate to each other, knows what "makes sense" and what doesn't. They do that through machine learning, just like our examples of the king and queen and of google translate.

What my phone's autocomplete doesn't know is me: Dui. It has no data about what I said, like to say, don't like to say, who I say it to, nothing. So it makes really bad guesses about my next word when I'm typing – like a model that hasn't learned about king and queen and prince and princess when asked the question.

Now, let's suppose I did the opposite: If I trained my own personal word model with all of the words I have ever said in my life, this blog, all my phone texts, all my emails, everything. With that personal model, my phone's autocomplete would become much better at guessing what word I plan to type next.

With no training, if I say "yo" it would guess "tengo," but with this ultimate training it may guess "sup?", for example.

Maybe now you see the connection: once your autocomplete knows grammar, and is trained on relevant information, it starts becoming better at guessing the next word you would say by finding these word connections and comparing them to the texts it's learned from; it's a more effective autocomplete.

Now, what happens if I get this model that's trained on all of my data, and knows all of my words and relationships between them, and instead of asking it to autocomplete the next word I'm typing, I ask it to autocomplete all my words?

Like replying to a text on my phone by just clicking the suggested auto-completion in the middle position at the top of my phone's keyboard over and over again, it will guess gibberish if it has no data on me, and it'll guess more accurately if it's seen every text I've ever written in my life.

So eventually, if I had such a personal model available to me, instead of directly replying to my texts (except for my wife's, of course), I could just tell this model: "Actually, go ahead and guess how I'd reply to this" and it would, making a more educated guess the more it knew about how I write based on what I've written before.

Word by word.

What large language models like ChatGPT created after the technological breakthrough of the transformer architecture was the ability to create this personal model, but instead of having only all of my personal information, equivalent to only a handful of books, it has a mind-boggling amount of information, equivalent to tens or hundreds of millions of books.

And so it can make really, really educated guesses about what one would say in each situation.

On a presentation, Blaise Arcas, a Google Research VP and Fellow, discussed the interesting interplay between next-word prediction and AI by musing about how we thought we'd only make autocomplete work really well by solving AI, and then seemed to solve the opposite problem: solved AI by building a really great autocomplete.

Next word prediction is “AI complete” → we will need to solve AI to do next word prediction ... and we don’t know how to “program” AI

Use a giant model to solve next word prediction by brute force → it works ... and appears to solve AI?

Blaise Arcas, Reassessing Intelligence - Insights from Large Language Models and the Quest for General AI

What is hard for us to visualize is what types of analogies a model like the [man and woman] and [king and queen] could make if it had hundreds of millions of books inside of it instead of just a few words, and it knew the connections between all of those words.

Instead of guessing "man and woman, king and .. queen," with that amount of information about how words connect and relate to each other, it can guess pretty much everything – like a grandmaster chess player that can guess the next best move given all the games they've studied and played before.

In summary: It's predicting the next word

So that's how next word prediction works to make ChatGPT write poetry about my tax returns:

  1. Neural network models can make simple guesses, like guessing the price of a house given its size
  2. With deep learning, models can make guesses order of magnitude harder based on much more data, like whether someone will click an ad given trillions of search histories
  3. With words instead of numbers, models can create analogies on the connections between words it's seen before to make guesses, like guessing that queen is to king like woman is to man, the same way it guessed house prices based on the connections between price and size
  4. Once word models are sophisticated enough and connect many words, they can make much more sophisticated guesses, like guessing the Spanish translation to an English sentence based on the connections of English words and their Spanish counterparts
  5. This word model sophistication and guessing is what is used to autocomplete our own words when typing on our phones, and the more they know about us and what we've said before, the better they can guess the next word we want to type
  6. If we asked this autocomplete model to autocomplete every word for us, one after the other, it'd generate a reply based on what it thinks we'd say given what it knows about us
  7. Create a model orders of magnitude bigger than that personal model, with data equivalent to hundreds of millions of books instead of just our personal data, and it can make some really educated guesses about how to reply to virtually anything

I know that I said I hate the "it works by predicting the next word" explanation, but in the end, that is how it works.

And with this guide, next time someone tells you that ChatGPT works by predicting the next word, you'll know what that means.