Tasks and Benchmarks for an AI Scientist
A lot of stuff has been flying around recently on the question of what impact machine learning and artificial intelligence will have on the way we do science. The main question seems to be: if we had an AI scientist, call it a Large Science Model (LSM), how would we evaluate it? How do we know if ChatGPT is ready to start doing science on its own? (It is not.) How will we know, when GPT-4 comes out, whether it will be ready? Twitter will no doubt be full of cherrypicked examples of GPT-4 solving math and science problems, but such examples are a far cry from the ability to do science that involves interfacing with facts, modeling the unknown world, and distinguishing truth from falsehood from maybe from “it’s complicated”. How will we tell whether it is ready, and if it is not, why not?
The business of science involves formulating a model for the world; attempting to reduce the uncertainty of the model using known facts; and then attempting to further reduce the uncertainty of the world using experiments. The task of formulating a predictive model of the world is distinct from other tasks in machine learning because scientific data is sparse and uncertain, compared to language data, for example, or image data. Thus, a LSM will likely need to perform tasks of inference and reasoning, not just statistical interpolation, in the presence of semantic ambiguity and significant uncertainty (but see also ). I think that the tasks we should expect to evaluate LSMs on break down into essentially three categories: retrieval, experimentation, and reasoning. In what follows, I describe these tasks, give some examples, and explore how ChatGPT currently performs on each.
The examples below are fun, but the next step in actually building an AI scientist is going to be to assemble some real benchmarks and corpuses for these tasks. I have some concrete ideas about how to do this and am starting to work on it. If you are interested, get in touch.
Task Group 1: Retrieval
The first major task area, retrieval of knowledge from the literature, is the most obvious. Some applications, like Elicit, are already working on this task area, but I think the potential goes far beyond what is currently possible. I see this task as having three tiers:
Task 1.1: Retrieval of literature that provides or contradicts a fact
This is the “google search” of literature retrieval, and is relatively easy. Examples for this task are also straightforward to generate, but are in general much more complicated than what is represented in the USMLE:
“What protein is mScarlet derived from?”
ChatGPT 🤖 answers that mScarlet is derived from mCherry, which is correct ✅.
“Has anyone ever tried to use gap filling on padlock probes to sequence barcodes in an in-situ-sequencing experiment?”
The answer is yes, commonly. ChatGPT 🤖 was “unable to find any specific examples” and suggested I contact a knowledgeable researcher 😆. ❌
“Bacteria are growing in the waste stream I use for Qiagen buffers. Can I bleach it?”
The answer is no, absolutely not: adding bleach to Qiagen buffers will produce cyanide and/or chlorine gas. ChatGPT 🤖 says yes, you can 💀💀💀. Interestingly, ChatGPT knows that adding bleach to guanidinium chloride or guanidinium isothiocyanate can produce chlorine or cyanide gas, respectively, but ChatGPT appears not to know that Qiagen uses guanidinium in its buffers, even though that information is readily available online. ❌
“Why don’t most culture microscopes have a far red filter set?”
The answer is because culture microscopes are usually used for direct ocular observation, and human eyes are not sensitive to far red channels. ChatGPT 🤖 gives a variety of other plausible but incorrect answers, for example that far red light is not strongly absorbed by most dyes. ❌
"Why do people use MMLV rather than lentivirus for lineage tracing experiments in the brain?"
The answer is that MMLV will only integrate into actively dividing cells, whereas lentivirus can also integrate into non-dividing cells, and for lineage tracing experiments one usually wants to infect only actively dividing progenitors. ChatGPT 🤖 says you can use either, which is maybe technically true but basically misses the point. ❌
Task 1.2: Retrieval of literature that implies or suggests a fact, or implies or suggests the negation of the fact
This is the (significantly) harder version: given a statement that is not directly stated in the literature, can the model identify other statements in the literature (across several papers), that together imply or negate the statement? This requires a combination of logical reasoning with the previous factual retrieval task. It is hard to come up with examples for this task that are distinct from the examples in Tier 1, because it is hard to prove that a given statement is not already presented somewhere in the literature. My guess is that the easiest way to create examples for this task will be to combine statements from recently published papers, or to mask statements in the discussion sections of papers or in review papers.
All statements in science that are not direct experimental observations have some degree of certainty associated with them. Even experimental outcomes have a degree of certainty associated with them, because the full parameters underlying an experiment are never known for sure. Thus, there is potentially a sub-task here, which is assignment of confidence to statements. Given a statement which is suggested (or whose negation is suggested) by the literature, can the model assign a confidence to the statement or its negation, and can it provide reasoning and citations? E.g., is a model capable of describing that the preponderance of evidence suggests that a particular fact may be true? At a higher level, could a model infer that papers by a given author tend to have sub-par controls in the supplementary data, and thus one should not in general trust the conclusions?
Task 1.3: Retrieval of latent knowledge
The hardest tier involves retrieving facts that are true, but that are neither written down nor directly implied or suggested by anything in the literature. The answers to these questions are usually only available to experienced practitioners:
Sometimes they are so obvious to practitioners that no one bothers to write them down explicitly.
Other times, people writing review articles on a topic may have distorted views of the utility of their own technologies, have limited experience with the alternatives, or may be incentivized to sell their technology as more useful or more general than it actually is.
Other times, I think people are also generally bad at determining what information is important to convey to other practitioners. (See my note about sensitivity/specificity tradeoffs at the end.) So, experienced practitioners may simply omit critical facts. Dan Goodwin recommended the story about the measurement of the quality factor of sapphire as an example, which involved a technician greasing a thread with oil from the bridge of his nose.
Finally, sometimes, the facts themselves are difficult to validate, so they do not have unique answers or are rarely written down in a peer-reviewed format.
As a result, when starting a new project, we usually have to spend years developing expertise in an existing method and figuring out how well it works before we can think about applying it in some application. This is what I call the “if only we had known” phenomenon, i.e., where there is a critical fact buried in the literature that would demonstrate that a specific experiment is wrong or futile, but that fact is only learned by experimenters after years of work. If machine learning could make it easier to adopt new methods and to extract truth from literature, it would be transformative for all of biology. (And, some preliminary work suggests machine learning algorithms may actually be very good at these tasks.)
Most examples for this task fall broadly into two categories: “why do people” questions, and “in practice” questions. Take with a grain of salt that I cannot prove that these facts are not in fact written anywhere in the literature. In fact, it is an interesting possibility that most latent knowledge IS in fact written somewhere in the literature, if we only had full knowledge. But my guess is, for some very niche areas, there is probably a lot of latent knowledge that isn’t written down.
“Why do people” questions:
“Why do people use iodixanol rather than sucrose to create density gradients for purifying AAVs?”
There are many possible answers, but I think the key one is that iodixanol forms its own density gradients, whereas sucrose needs to be layered, so it is less work to use iodixanol. ChatGPT 🤖 mentioned several other possibilities, such as the difference in osmolality and the fact that sucrose is less expensive and may thus be preferred in some cases. Partial credit. 🔶
“Why don’t neuroscientists ever use Cre recombinase to control gene expression off of a Rabies virus for circuit tracing?”
The answer is that Cre is a DNA recombinase and rabies is an RNA virus. ChatGPT 🤖 fails to understand the question and just babbles. ❌
“When doing a pooled CRISPR screen, why do people use lentiviral libraries rather than AAV libraries?”
The answer is that you need to be able to expand the cell population after infection, and lentivirus is integrating whereas AAV is not. ChatGPT 🤖 provides several advantages of lentivirus over AAV, including the fact that lentivirus is integrating, but does not connect them to the requirements of a pooled CRISPR screen. ❌
“Why don’t people use antibody-oligo conjugates more for multiplexed antibody staining?”
Anyone who has ever touched an antibody-oligo conjugate will know that they have terrible off-target effects, and that that is the primary limitation. ChatGPT 🤖 mentions “complexity,” “limited availability,” “signal intensity,” “cost,” and “non-specific binding.” Again, it provides the correct answer, but it provides 4 answers that are basically incorrect. It’s guessing. ❌
“In practice” questions:
“How well does expansion microscopy work in practice, and what are the biggest challenges?”
A good answer here should mention that the original expansion microscopy protocols are actually pretty straightforward, but they require some skilled manual handling of samples. In addition, more recent protocols are very long, and it is difficult to obtain high-quality sodium acrylate. Instead, ChatGPT 🤖 mentions the fact that ExM can achieve high resolution, and says the biggest challenge is ensuring the expansion process is homogeneous, which is actually not a challenge at all for most protocols. ❌
“How hard is it in practice to make minibinders?”
I would expect an answer here to mention the fact that you have to do yeast display, and usually get between 10 and 100 good binders out of a library of maybe 10,000-100,000. ChatGPT 🤖 mentions that you need to first predict some binders and then do subsequent mutagenesis, but doesn’t actually provide any details about how hard it is, or why it is hard. ❌
“How hard is it in practice to create an AAV that is specific for a particular cell type? What is the hardest part in practice?”
I would expect an answer here to mention that the hardest part is either producing the viral library or actually conducting the rounds of selection. Instead, ChatGPT 🤖 says that the hardest part is evolving the capsid, which is, in fact, the entirety of the task, and doesn’t make sense as an answer. ❌
Task Group 2: Experimentation
When the uncertainty of a statement cannot be sufficiently reduced using the literature, experiments are necessary. Thus, the second major task area involves planning experiments to validate hypotheses. Given an uncertain statement that we cannot find the answer to in the literature, how do we test it in the natural world?
Task 2.1: De novo protocol generation
Given an objective, a model should be able to generate a protocol to achieve that objective (e.g., to disprove a hypothesis). This task is straightforward. ChatGPT performs well on two of the most common protocols in biology, and (predictably) poorly on two rarer protocols.
“I have HEK cells growing in a 10 centimeter dish and I need to passage them. What is the first step in the protocol?”
The answer is to remove the old media, which ChatGPT 🤖 gets correct. ✅
“I am doing a western blot and have just incubated the membrane in my primary antibody. What are the next three steps of the protocol?”
You need to wash, add the secondary antibody, and wash again. ChatGPT 🤖 says you need to wash, add the secondary antibody, and then add the detection reagent. Conceptually it’s basically correct, but as a protocol (e.g. if ChatGPT were controlling a lab automation setup) this would fail. Partial credit. 🔶
“I am doing hybridization chain reaction and have just finished washing off the probes. I am ready to start the amplification step. What is the first step in the protocol?”
I would accept either washing the tissue in amplification buffer or snapcooling the hairpins. Instead, ChatGPT 🤖 says “the first step in the amplification protocol is typically to add a mixture of amplification enzymes and cofactors to the reaction mixture.” This is completely wrong, since HCR is non-enzymatic. It appears that ChatGPT has simply substituted in “HCR” to a generic response about amplification techniques. ❌
"I am doing Slide-seq and have just melted the tissue slice onto my array. What is the next step in the protocol?"
Note that Slide-seq was published in 2019 and is thus technically within the training domain of ChatGPT. The next step is to do a permeabilization wash and then reverse transcription. ChatGPT 🤖 says “After melting the tissue slice onto the array in a Slide-seq experiment, the next step in the protocol is typically to perform in situ hybridization (ISH). This involves incubating the array with a mixture of RNA probes that are complementary to the transcripts of interest.” This is completely wrong, but it clearly has made an association between Slide-seq and spatial transcriptomics, and is providing next steps on a FISH protocol, which it probably knows about. ❌
Once algorithms actually get better at this task, it seems like a primary challenge will be in semantics. To determine whether a predicted protocol step is correct, we must be able to determine whether two different protocol steps are nonetheless semantically equivalent or would result in an identical outcome. Some differential changes in protocol wording would lead to discontinuous (or disastrous) outcomes, while other major rephrasing will lead to the same result. In this way, this task seems conceptually similar to the AlphaCode task of creating new programs, but without access to an easy-to-run oracle to validate programs, which will make the challenge much, much harder. (In addition, even if one could run an oracle, e.g. by performing many experiments in the lab, the oracle would likely have a latency of several days.)
Task 2.2: Generation of automation code
It seems obvious that a key task for large science models eventually will be generation of code for robotic automation. The combination of LSMs with robotic automation will ultimately pave the way for “self-driving labs.” I have no examples for this, but it is straightforward.
Task 2.3: Protocol annotation
“I am doing a single cell RNA sequencing experiment. What should my RT primer look like?”
Your RT primer generally needs to contain a PCR handle, a UMI, a cell barcode, and a poly(T) sequence, although there are variations in which the UMI or cell barcode may be supplied at a different step. ChatGPT 🤖 fails in an extremely strange way. It mentions that we need a UMI, which is optional, a barcode sequence, also optional, a sequence complementary to the 3’ end of the RNA and, drumroll please… a T7 promoter sequence…??? Hmm… 🤔 ❌
“I need to purify APEX-SpyCatcher. Anything I should be aware of?”
You should make sure not to use StrepTag for affinity purification. APEX is usually used to biotinylate proteins, and if you use StrepTag for purification you will be unable to separate the strep-tagged APEX from biotinylated proteins. You should also be aware that APEX is known to multimerize at high concentrations, which leads to a loss of activity, so when purifying, you need to selectively elute the low molecular weight fraction. ChatGPT 🤖 does not know what APEX is, and gets the question wrong in a very compelling way, see below. ❌
“I am preparing to do a miniprep for the first time. What are the most common ways people mess up minipreps, and what do I need to watch out for?”
Everyone, at some point in their lives, forgets to add ethanol to the miniprep column wash buffers. This is a mistake everyone makes exactly once. ChatGPT 🤖 misses that example, and provides a number of other examples which are mostly extraneous. The only one it provides which seems really relevant is “overloading the column,” which can indeed be a problem. Possibly partial credit. 🔶
“I have a new AAV variant and I want to compare its infectivity to the infectivity of AAV2 in a specific cell type. I am preparing the experiment now. What do I need to control for?”
Anyone who does these experiments realizes that it is very difficult to ensure that two distinct viruses have comparable titers (functional and physical), because different batches of the same AAV serotype can have different ratios of functional to physical titer. The best answer would probably be to measure their physical titers and normalize them. ChatGPT 🤖 mentions AAV dose, and then provides several extraneous and irrelevant factors, like cell confluency, cell density, etc., which are unimportant because presumably you would test the two viruses in the same cells simultaneously. Partial credit. 🔶
"I am trying to test a new transfection method. My experimental condition is to transfect cells with a plasmid that contains luciferase. I will then measure the amount of luciferase produced. What controls do I need?"
You need a positive control for the luciferase assay, which involves transfecting cells with a luciferase plasmid using an established protocol; and a negative control, which involves not transfecting the cells at all. It would also make sense to transfect the cells using your new method with a dummy plasmid, like pUC19, in case your new transfection method somehow generates background in the luciferase. You may also want to do a live/dead assay and normalize the luciferase by the number of viable cells. ChatGPT 🤖 recommends a negative control and a positive control, although for the positive control it says the important thing is to use a well-known active form of luciferase, which is incorrect. The experiment would not work if you use a different luciferase for the positive control. It seems like ChatGPT understands the form of controls but does not understand that the goal is to test the transfection method, or cannot use that information inform its choice of controls. ❌
"I am exploring differential gene expression between two conditions. I have six replicates for the experimental condition and six replicates for a control. I have a number of statistically significant hits, but I want to be sure that my statistical test is working. What should I do?"
The answer is to just compare some of the controls to each other using the statistical test, and check that the false positive rate is as you expect. ChatGPT 🤖 suggests a lot of random stuff like “Visualize the data,” “check for confounds,” and “consult with a statistics expert.” No value-add here. ❌
This task will also encounter one of the core difficulties that I think LSMs will encounter, which is a sensitivity/specificity tradeoff in experimental design and communication. I discuss this in the closing notes below. It will also suffer from the same challenge mentioned above about validation.
Task Group 3: Reasoning
The final set of tasks are “reasoning” tasks, i.e., those that require higher-order manipulation of concepts. These are the tasks that occupy most of a scientist’s time, and that typically distinguish very good scientists from exceptional scientists.
Task 3.1: Prediction
Given an experiment, can the model predict the outcome? This task is straightforward.
"I am going to fuse GFP to the surface of my AAV at residue 530. Do you think this will affect the titer of the virus during production?"
The answer is definitely yes. ChatGPT 🤖 says it might affect the titer, and suggests I think about fusion protein size, fusion protein localization, and fusion protein stability. Its answer is useless. ❌
"I am going to try to infect cells with AAV2 in the presence of 1mM heparin sulfate. What will happen?"
HSPG is a primary receptor for AAV2. Adding a lot of HSPG in the media might be expected to reduce infectivity of AAV2 for the target cells; but some literature suggests it can also increase infectivity in some cases. To its credit, ChatGPT 🤖 mentions that heparin sulfate may compete with HSPG. But mostly its response is generic and not useful. ❌
Task 3.2: Interpretation
Given data that is known in the literature and a description of a protocol, can the algorithm provide a mechanistic interpretation? ChatGPT’s performance here is limited by the fact that it doesn’t provide citations. It makes suggestions, but those suggestions can’t even be evaluated without evidence. The ability to provide specific references will actually be a prerequisite to even evaluate the model’s performance on this kind of task.
"Cells are dying when we electroporate with 600ng/uL of RNA outside the cell, but not when we use 450ng/uL RNA. Why might that be?"
I don’t know the answer to this question. ChatGPT 🤖 provides a bunch of interesting suggestions, such as the idea that the RNA may be toxic or that it might interfere with the “formation of a stable electric field” (???). However, without any citations, it’s hard to evaluate those answers. ❓
"I am working with a new highly stable fluorescent protein, and I noticed that it doesn’t get denatured in high concentrations of guanidinium chloride, unlike GFP. However, it does get denatured in high concentrations of guanidinium isothiocyanate. Any idea why?"
The answer to this question is actually known, and Elicit was able to find it: the isothiocyanate salt cooperates with the guanidinium in denaturing the backbone. ChatGPT 🤖 provides some suggestions that seem improbable, such as that GITC binds to more hydrophilic regions and GdmCl binds to more hydrophobic regions. And, as before, these suggestions are not very valuable without evidence. ❓
Obviously, at a higher level, one would hope that a model could actually look at the data itself.
Task 3.3: Ideation
Given an objective, can the model suggest ideas for how to overcome the challenge? I actually think this task will be relatively easy for LSMs that are good at protocol generation. My intuition is that “ideas” are actually clusters of protocols in a suitably defined latent space, and that the process of ideation is essentially the process of identifying clusters in that space. My guess is that algorithms that are good at protocol generation will also therefore be good at ideation. (However, if given a very high level objective, it may be difficult to determine whether a given protocol is actually getting closer to satisfying the objective. And, the hardest part about evaluating a model on an ideation task will be validating that its ideas are actually good.)
The impact of generative idea models for science could be huge. There is another common phenomenon in bioengineering, which I call the “we should have thought of this years ago” phenomenon: you can be a distance epsilon away from a superior idea for a long time -- years, sometimes -- before finally stumbling upon it. Sometimes, you didn’t have the information you needed to realize that it was a good idea, and sometimes you just didn’t think about it, and sometimes it wasn’t possible then but it is now. Machine learning models may be able to supply large numbers of ideas systematically, thus preventing you from “not thinking about” a particular idea. In an ideal world, you could even imagine that if you had access to a catalog of all possible ways to solve a problem, you could design optimal experiments to discriminate between those ideas. Unfortunately, for the moment at least, creativity in science currently seems to be beyond ChatGPT.
"I need to come up with a way to detect cancer cell clones when there are only 10,000 of them in the body. Can you suggest any ideas?"
As expected, ChatGPT 🤖 suggests some generic things like looking for circulating tumor cells or cell surface markers or cfDNA. But nothing creative or interesting. ❌
"I need to find a way to get AAVs to cross the blood-brain barrier. Can you suggest any ideas?"
Same as above. ChatGPT 🤖 mentions some normal stuff like convection-enhanced delivery and focused ultrasound. Nothing interesting. ❌
Caveat 1: We may need more data-efficient algorithms to achieve the tasks above.
The scientific literature is small: it consists of roughly 100M to 200M papers total, each constituting probably around 2000 tokens, for a grand total of likely only a few hundred billion tokens. (Most of these papers are not currently available, although that problem may be solvable.) ArXiv is only 20B tokens, and has already been exhausted. Similarly, Galactica has been trained on 100B tokens derived mostly from 48 million papers, and yet demonstrates relatively underwhelming performance on a range of simple metrics, such as MedQA-USMLE. These models are far from being able to perform the tasks listed above. Fine-tuning models that are trained on even larger corpuses of general text may lead to better results. However, fundamentally new, more data-efficient approaches may also be necessary. Dan Goodwin has a good write-up on this point, here.
Caveat 2: Validation of information in science is often as hard as gathering the information to begin with.
One of the most challenging aspects of science is the fact that the uncertainty associated with statements is often so large that validating the statement is almost as hard as gathering the information in the first place. (This is essentially the reproducibility crisis.) The same phenomenon applies to LSMs. In all of the cases above, it will be essential for the algorithm to provide highly detailed answers with citations, and with descriptions of the citations. Especially as the complexity of the task increases, for example when it comes to interpreting results, making inferences, or retrieving latent knowledge, it will become extremely hard to validate answers provided by algorithms. For example, ChatGPT will often provide convincing answers to latent knowledge questions, but actually validating those answers is roughly as difficult as coming up with the answers in the first place. In the business of science, where truth is usually almost indistinguishable from fiction, and where the difference between truth and fiction is measured in wasted years, machine learning models will be useless unless they can provide enough detail and enough traceability to enable validation.
As a cautionary example, I asked ChatGPT my question about purifying APEX-SpyCatcher. It provided a very generic but useful response, including a note that APEX is sensitive to oxidation and that I therefore need to purify it under reducing conditions, which I thought was surprisingly useful. I could then have gone off happily and done my experiment. But I decided instead to ask further why APEX-SpyCatcher is sensitive to oxidation, which I was not aware of. ChatGPT responded:
APEX-spycatcher is a protein that binds to the APE1 enzyme, which is involved in repairing oxidative damage to DNA. The APEX-spycatcher protein is sensitive to oxidation because it contains a cysteine residue that can be oxidized, which in turn affects the protein's ability to bind to the APE1 enzyme and perform its DNA repair function. This sensitivity to oxidation makes APEX-spycatcher a useful tool for measuring oxidative stress in cells and tissues.
At this point, it is clear that ChatGPT has not even understood the question I was asking, doesn’t know what spycatcher is, and is even confused about the protein it thinks I am talking about. (APEX is a protein used in proximity labeling, which is what I was asking about. There is also an enzyme called APE-1 which is sometimes referred to as APEX. The statement that “APEX-spycatcher is a protein that binds to APE-1” is not correct under any circumstances, as far as I know.)
The question was in fact ambiguous, but clearly we have some way to go.
Caveat 3: There is a sensitivity/specificity tradeoff in information-dense answers:
With regard to retrieving latent knowledge: one potential failure mode (which chatGPT is already subject to) is that a LSM may just provide a deluge of information, all of it correct, and only some of it relevant. For example, when asked to provide a list of controls, a LSM could conceivably recommend controlling for things that technically need to be controlled but that are highly unlikely to affect the experimental conditions. Already, when asking ChatGPT what factors I should control for in an expansion microscopy experiment, it gives me a list of 6 factors, all of which are correct, but most of which are banal and irrelevant.
I think this gets to a more fundamental issue in the scientific literature: the reason there is so much latent knowledge is that humans prioritize (attend to) specific pieces of information at specific times, and the information that was important to the author at the time a specific article was written may be different from the information that is important to you now. There is something like a classic sensitivity/specificity trade-off: the more incisive and to-the-point we are, the more we risk leaving out critical latent information. LSMs with good attention mechanisms will encounter the same problem: if you ask an LSM for a list of the controls you need in order to prove a given statement, it may indeed provide you with a list of 500 controls that would decisively prove the statement, and yet that would be far beyond what you are capable of performing. And yet, if you do not perform them all, the LSM could reasonably insist that you had not yet proven the statement. It will, in fact, be unclear whether to penalize LSMs for providing controls that are technically correct but that we think are extraneous. Do we want LSMs to think about science like we do? Or do we want them to think in a different way? This is, in fact, the core challenge of science: building models with imperfect information. It is possible that LSMs, with greater information processing power, might ultimately be able to prioritize even better than humans; but if so, they may have a hard time convincing humans that their conclusions are correct.
A final optimistic point:
I will end by saying that I do think it is highly likely that machines can achieve superhuman performance in science. Pessimistically, a machine that achieves human-level performance on reasoning and planning, but that achieves superhuman retrieval simply by virtue of reading all the literature, would likely achieve significantly superhuman performance on many day-to-day scientific tasks. However, significant advances in machine learning may still be necessary, especially on the topic of data-efficient learning.
Some DALL-E Images
These are some images of "a machine learning algorithm doing science." I have no idea what they mean, but I am extremely excited to find out...