405.「If we could use sufficiently long n-grams, we might “get a ChatGPT” with “correct overall probabilities.” However, there isn’t enough English text to deduce these probabilities accurately. The solution is to build a model to estimate sequence probabilities, even if they haven’t appeared in our text corpus.」
书籍名称:《What Is ChatGPT Doing … and Why Does It Work?》
基础信息:Stephen Wolfram / 2023 / Wolfram Research, Inc.
豆瓣评分:8.2/10
豆瓣链接:https://book.douban.com/subject/36325029/
读完时间:2024-10-28 13:07:51
我的评分:4.0/5.0
我的标签:微信读书,#2024
免责声明:本页面所发布的笔记仅用于分享我在阅读过程中的摘录、总结和反思。内容大多为书中原文或书中观点的简要提炼,并不代表我个人的立场、意见或价值观。书中观点仅供参考,如需深入了解或采纳,请参考书籍的原始内容。
阅读笔记:
《What Is ChatGPT Doing … And Why Does It Work?》
《ChatGPT在做什么……它为什么有效?》
Stephen Wolfram
ChatGPT is based on the concept of neural nets—originally invented in the 1940s as an idealization of how brains operate. I first programmed a neural net in 1983, and at the time, it didn’t do anything interesting. But 40 years later, with computers that are effectively a million times faster, billions of pages of text available on the web, and a series of engineering innovations, the situation is quite different. To everyone’s surprise, a neural net that is now a billion times larger than the one I used in 1983 is capable of doing what was once thought to be uniquely human: generating meaningful human language.
===============
The remarkable thing is that when ChatGPT writes something like an essay, it’s essentially just asking over and over again, “Given the text so far, what should the next word be?”—and each time it adds a word. (More precisely, as I’ll explain, it’s adding a “token,” which could be just part of a word. This is why it can sometimes “make up new words.”)
At each step, ChatGPT generates a list of words with associated probabilities. But which word should it actually pick to add to the essay (or whatever it’s writing)? One might think it should always pick the “highest-ranked” word (i.e., the one with the highest probability). However, this is where some unexpected behavior—or a bit of “voodoo”—comes into play. For reasons we don’t yet fully understand (and may one day explain scientifically), if ChatGPT always selects the highest-ranked word, it often produces a “flat” essay that lacks “creativity” and sometimes even repeats words verbatim. By occasionally (and randomly) choosing lower-ranked words, it tends to create a “more interesting” essay.
Because of this randomness, if we use the same prompt multiple times, we’re likely to get different essays each time. In keeping with the “voodoo” concept, there’s a specific parameter called “temperature” that determines how often lower-ranked words will be chosen. For essay generation, a “temperature” of 0.8 seems to work best. (It’s worth noting that there’s no “theory” guiding this choice; it’s simply what has been found to work in practice. For instance, the term “temperature” is used because of its association with exponential distributions in statistical physics, but there’s no actual “physical” connection—at least, as far as we know.)
===============
It’s becoming slightly more “sensible looking.” We might imagine that if we could use sufficiently long n-grams, we’d essentially “get a ChatGPT”—in the sense that we’d have something capable of generating essay-length sequences of words with the “correct overall essay probabilities.” However, there’s a problem: there simply isn’t nearly enough English text that’s ever been written to deduce these probabilities accurately.
A crawl of the web might yield a few hundred billion words, and digitized books might add another hundred billion words. But with 40,000 common words, even the number of possible 2-grams already reaches 1.6 billion—and the number of possible 3-grams balloons to 60 trillion. So, estimating probabilities even for these is impractical using existing text alone. By the time we reach “essay fragments” of 20 words, the number of possible sequences is larger than the number of particles in the universe, meaning they could never all be documented.
So, what can we do? The big idea is to create a model that allows us to estimate the probabilities of sequences occurring—even if we’ve never explicitly encountered those sequences in the text corpus we’re using. At the core of ChatGPT is precisely this type of model: a “large language model” (LLM) designed to excel at estimating these probabilities.
===============
It’s important to understand that there’s no such thing as a “model-less model.” Any model you use has a specific underlying structure, along with a certain set of “knobs you can turn” (i.e., parameters you can adjust) to fit the data. In the case of ChatGPT, there are a vast number of these “knobs”—actually, 175 billion of them.
The remarkable thing, however, is that the underlying structure of ChatGPT—with “just” that many parameters—is enough to create a model that computes next-word probabilities “well enough” to produce reasonable, essay-length pieces of text.
===============
In the earlier days of neural nets, there was a prevailing idea that one should “make the neural net do as little as possible.” For example, in converting speech to text, it was thought that the process should begin with analyzing the audio, breaking it down into phonemes, and so on. However, it was discovered that—at least for “human-like tasks”—it’s often more effective to train the neural net on the “end-to-end problem,” allowing it to “discover” the necessary intermediate features, encodings, and so forth on its own.
There was also a notion that one should introduce complex individual components into the neural net to, in effect, “explicitly implement specific algorithmic ideas.” Yet, this approach has largely proven ineffective. Instead, it’s generally better to work with very simple components and allow them to “organize themselves” (often in ways we don’t fully understand) to achieve what appears to be the equivalent of those algorithmic ideas.
===============
In other words, by this stage, the neural net is “incredibly certain” that the image represents a 4. To actually get the output “4,” we simply need to identify the position of the neuron with the largest value.
But what if we look one step earlier? The final operation in the network is a function called softmax, which works to “force certainty” in the output.
===============
We just discussed creating a characterization (and thus an embedding) for images based on identifying their similarity by determining whether they correspond to the same handwritten digit, according to our training set. We can apply this approach more broadly to images if we have a training set that identifies, for example, which of 5,000 common object types (cat, dog, chair, etc.) each image represents. In this way, we can create an image embedding that’s “anchored” by our identification of common objects, while also “generalizing around that” based on the neural net’s behavior.
The key point is that, to the extent this behavior aligns with how humans perceive and interpret images, this embedding will end up “feeling right” to us. It also becomes practically useful for performing tasks that require “human-like judgment.”
===============
How do we set up this problem for a neural net? Ultimately, we need to formulate everything in terms of numbers. One approach is to assign a unique number to each of the approximately 50,000 common words in English. For example, “the” might be assigned 914, and “cat” (with a space before it) might be 3542. (These are actual numbers used by GPT-2.) So, for a “the ___ cat” problem, our input could be {914, 3542}.
What should the output look like? Ideally, it should be a list of around 50,000 numbers representing the probabilities for each possible “fill-in” word. To create an embedding, we want to “intercept” the “inner workings” of the neural net just before it “reaches its conclusion.” By capturing the list of numbers that appear at this stage, we can think of them as a way to “characterize each word.”
===============
If we measure distances between these vectors, we can identify “nearnesses” of words. Later, we’ll discuss in more detail the potential “cognitive” significance of such embeddings. For now, the main point is that we have a way to transform words into “neural-net-friendly” collections of numbers.
But we can actually go further than characterizing individual words with these numerical collections; we can also apply this to sequences of words or even entire blocks of text. Inside ChatGPT, this is exactly how it operates: it takes the text it has so far and generates an embedding vector to represent it. Its goal is then to calculate the probabilities for different words that might come next, represented as a list of numbers indicating the probabilities for each of the roughly 50,000 possible words.
(Technically, ChatGPT doesn’t work with whole words, but rather with “tokens”—linguistic units that might be entire words or just parts, like “pre,” “ing,” or “ized.” Working with tokens allows ChatGPT to handle rare, compound, and non-English words more effectively and, occasionally, to invent new words.)
===============
Within each attention block, there is a collection of “attention heads” (12 for GPT-2 and 96 for ChatGPT’s GPT-3), each operating independently on different portions of the embedding vector. (And yes, we don’t have a specific reason why splitting up the embedding vector in this way is beneficial, nor do we fully understand what the different parts “mean”; it’s simply one of those things that has been “found to work.”)
===============
What determines this structure? Ultimately, it’s likely some kind of “neural net encoding” of features inherent to human language. However, what those features might actually be remains largely unknown. In effect, when we “open up the brain of ChatGPT” (or at least GPT-2), we find that, yes, it’s complicated inside, and we don’t fully understand it—even though, in the end, it produces recognizable human language.
===============
Even in the seemingly simple cases of learning numerical functions that we discussed earlier, we often found it necessary to use millions of examples to successfully train a network, at least from scratch. So, how many examples might be needed to train a “human-like language” model? There doesn’t appear to be any fundamental “theoretical” way to determine this. In practice, however, ChatGPT was successfully trained on a few hundred billion words of text.
Some of this text was fed into the model multiple times, while some was only seen once. Yet, somehow, it “absorbed” the necessary information from the text it encountered. Given this volume of text, how large a network should be required to “learn it well”? Again, we don’t yet have a fundamental theoretical way to answer this. Ultimately—as we’ll discuss further below—there’s likely a certain “total algorithmic content” to human language and what humans typically express with it. The next question, then, is how efficiently a neural net can implement a model based on this algorithmic content. And, once again, we don’t know—though the success of ChatGPT suggests that it operates with reasonable efficiency.
===============
When we run ChatGPT to generate text, we essentially need to use each weight once. So if there are n weights, we require approximately n computational steps—although, in practice, many of these steps can be done in parallel on GPUs. If we need around n words of training data to configure these weights, then, based on what we’ve discussed, we can conclude that approximately n² computational steps are required to train the network. This is one reason why, with current methods, training large language models often involves billion-dollar efforts.
===============
And, yes, up to a certain length, the network performs just fine. But then it starts to struggle. This is a typical issue in a “precise” scenario with neural nets (or with machine learning in general). Cases that a human “can solve at a glance” are often solvable by the neural net as well. However, when a task requires something “more algorithmic” (such as explicitly counting parentheses to ensure they’re closed), the neural net tends to be “too computationally shallow” to reliably handle it. (Incidentally, even the current full version of ChatGPT has difficulty correctly matching parentheses in long sequences.)
So what does this mean for systems like ChatGPT and the syntax of a language like English? The structure of parentheses is “austere” and more of an “algorithmic challenge.” In English, however, it’s more practical to “guess” what fits grammatically based on local word choices and contextual hints. The neural net is much better at this, even though it might occasionally miss a “formally correct” structure that humans might overlook as well.
The key point is that the existence of an overall syntactic structure in language—with all the regularity this implies—effectively limits “how much” the neural net needs to learn. A crucial “natural-science-like” insight is that the transformer architecture in neural nets like the one in ChatGPT appears capable of learning the kind of nested, tree-like syntactic structure that seems to exist (at least approximately) in all human languages.
===============
There’s certainly no “geometrically obvious” law of motion here. This isn’t surprising; we fully expect this to be a considerably more complex situation. For example, even if a “semantic law of motion” does exist, it’s far from obvious what kind of embedding (or, effectively, what “variables”) it would be most naturally expressed in.
===============
And, yes, this seems like a mess—and it doesn’t particularly encourage the idea that we can expect to identify “mathematical-physics-like” “semantic laws of motion” by empirically studying “what ChatGPT is doing inside.” However, it’s possible we’re simply looking at the “wrong variables” (or the wrong coordinate system), and that if we could find the right one, we might immediately see ChatGPT doing something “mathematical-physics-simple,” like following geodesics. But as of now, we’re not yet able to “empirically decode” from its “internal behavior” what ChatGPT has “discovered” about the structure of human language.
===============
Semantic Grammar and the Power of Computational Language
What does it take to produce “meaningful human language”? In the past, we might have assumed that this task could only be achieved by a human brain. But now we know that it can be done quite effectively by the neural network of ChatGPT. Still, perhaps that’s the limit, and there might be nothing simpler—or more comprehensible to humans—that could accomplish this.
However, I strongly suspect that the success of ChatGPT implicitly reveals an important scientific insight: that there’s actually far more structure and simplicity to meaningful human language than we previously understood. In the end, there may even be relatively simple rules that describe how such language can be constructed.
===============
Semantic Grammar and the Power of Computational Language
◆ But my strong suspicion is that the success of ChatGPT implicitly reveals an important “scientific” fact: that there’s actually a lot more structure and simplicity to meaningful human language than we ever knew—and that in the end there may be even fairly simple rules that describe how such language can be put together.
◆ It’s worth mentioning that even if a sentence is perfectly OK according to the semantic grammar, that doesn’t mean it’s been realized (or even could be realized) in practice. “The elephant traveled to the Moon” would doubtless “pass” our semantic grammar, but it certainly hasn’t been realized (at least yet) in our actual world—though it’s absolutely fair game for a fictional world.
◆ How should one figure out the fundamental “ontology” suitable for a general symbolic discourse language? Well, it’s not easy. Which is perhaps why little has been done in these since the primitive beginnings Aristotle made more than two millennia ago. But it really helps that today we now know so much about how to think about the world computationally (and it doesn’t hurt to have a “fundamental metaphysics” from our Physics Project and the idea of the ruliad ).
So … What Is ChatGPT Doing, and Why Does It Work?
◆ But for now it’s exciting to see what ChatGPT has already been able to do. At some level it’s a great example of the fundamental scientific fact that large numbers of simple computational elements can do remarkable and unexpected things. But it also provides perhaps the best impetus we’ve had in two thousand years to understand better just what the fundamental character and principles might be of that central feature of the human condition that is human language and the processes of thinking behind it.
A Few More Examples
◆ And, yes, one can imagine finding a way to “fix this particular bug”. But the point is that the fundamental idea of a generative-language-based AI system like ChatGPT just isn’t a good fit in situations where there are structured computational things to do. Put another way, it’d take “fixing” an almost infinite number of “bugs” to patch up what even an almost-infinitesimal corner of Wolfram|Alpha can achieve in its structured way.
The Path Forward
◆ What about ChatGPT directly learning Wolfram Language? Well, yes, it could do that, and in fact it’s already started. And in the end I fully expect that something like ChatGPT will be able to operate directly in Wolfram Language , and be very powerful in doing so. It’s an interesting and unique situation, made possible by the character of the Wolfram Language as a full-scale computational language that can talk broadly about things in the world and elsewhere in computational terms.
Additional Resources
◆ “What Is ChatGPT Doing … and Why Does It Work?” Online version with runnable codewolfr.am/SW-ChatGPT
“Machine Learning for Middle Schoolers” (by Stephen Wolfram) A short introduction to the basic concepts of machine learningwolfr.am/ML-for-middle-schoolers
— 来自微信读书
Hello!
Good cheer to all on this beautiful day!!!!!
Good luck 🙂