This week, Meta announced a language model called Galactica that was able, given a query, to produce plausible looking texts in scientific domains. A specific claim in the paper is:

… It could synthesize knowledge by generating secondary content automatically: such as literature reviews, encyclopedia articles, lecture notes and more. And lastly, it could organize different modalities: linking papers with code, protein sequences with compounds, theories with LaTeX, and more. …

The team were kind enough to provide a live demo, beautifully presented, which demonstrated some of these capabilities. Like everyone else, I tried it, and can report that it was reliably pretty good at capturing the style. Certainly, the vision of being able to produce customized secondary content on demand is attractive. On the other hand, what if the secondary content is unreliable or misleading? It is well-known that texts generated by previous language models were impressively fluent, especially within a narrow window, but prone to inconsistencies at higher levels of structure. For example, in a story, the topic may drift in unexpected ways, or character names may change for no apparent reason, or things may change color more or less at random. It appears that these models are not maintaining a consistent enough representation of “what is going on in the story”. An advocate will argue that they must be doing at least some of this, because aspects of the output are consistent, while a skeptic will draw attention to the deficiencies. If the sole purpose of the model is to generate engaging output, this might be OK. If, for some reason, you need the story to be consistent, it is not.

Where does Galactica, which targets scientific knowledge, stand relative to this kind of requirement? Unfortunately, despite all its strengths, not outstandingly well. In fact, when people began to play with the demo, it turned out not to be at all hard to obtain output that was variously offensive, disturbing, generally misguided or just plain wrong. This produced a storm of bad publicity, after which the demo was taken down. Part of the issue is the simple fact that if a high-profile company puts a demo up on the Internet, there will always be people who will be motivated to break the demo. But there is clearly more to it than that.

One issue is the enabling of scientific malpractice. Koestler’s “Case of the Midwife Toad” and other examples, such as Cyril Burt’s twin studies, show that researchers can gain short-term benefit and reputation by publishing bogus studies. The base rate of serious scientific malpractice in the peer-reviewed and published research literature is unknown, probably quite low, and the review process is designed to keep it that way. It is convenient that substantively bad scientific work is often bad along other easily detectable dimensions, such as terminology and style. This facilitates the peer review process, because rejecting a paper that is obviously very badly written takes less reviewer effort. Since Galactica produces text that looks good along these dimensions, and people will submit such papers, it runs the risk of burdening the review process.

A second issue, which turned up even more strongly in a system called ChatAI is that models like this are very good at pastiche: the emulation of a style. To determine if something ‘looks right’ we rely on shallow patterns, and they are great at that. This matters because,in our experience with people a command of style usually goes together with command of the content. A person who can produce math that is content correct and in good style can also produce stuff that is in good style but content incorrect. But they usually don’t. So when we encounter model output we tend to evaluate it on the assumption that style and content are pointing in the same direction.

Abandoning that assumption is hard for us to do. We tend to fall back in to the mental shortcut of relying on the correlation. Consequently we tend to cut the models more slack than they deserve. The model’s failings don’t look like the failings of an ignorant or incompetent human being: they look like the failings of a very smart person. If we even notice that they are failings at all, they look pretty good to us. That’s because, in the world of humans, the only likely way to acquire the ability to pastiche good math writing is to learn how to do it for real, so the best explanation for a thing that does that is that it is somehow extremely smart.

In reality, we need to realize that their failings are not like the failings of any human, but along different dimensions. Our intuitions fail. The remedy is not to rely on “looks good to me”, but instead to carefully probe the abilities of whatever you are studying. Incidentally, a course in experimental psychology will almost inevitably have the effect of making you question “looks good to me” as a way of finding out anything substantial about humans.