Tasks and Benchmarks for an AI Scientist

About a month ago, I posted some “real-world” science questions on my blog which I used to evaluate the performance of ChatGPT. (See here.) Since GPT-4 was released yesterday, here is the update. Every question was asked in a new chat, exactly once (except for the minibinders question, which it misunderstood.) I made sure I was using the GPT-4 model each time.

Surprisingly, the overall verdict is that on this evaluation, ChatGPT-4 performed about at parity with the original ChatGPT (scores below). It is probably slightly better, although the evaluation set is probably not large enough to measure small differences. Nonetheless, they both failed on the majority of the questions. I think it is possible that the RLHF may be suppressing the potential of the underlying model, and that prompt engineering/more work with the underlying model may yield big improvements. These results should be considered very preliminary. Nonetheless, I am providing these results as-is, for direct comparison to the original ChatGPT.

In addition, it’s worth noting that these questions are hard to grade, and I’m aware that ChatGPT-4 doesn’t know how much detail I’m looking for. I’m coming up with a new benchmark with categorical answers that can be graded unambiguously. Nonetheless, many of the answers it provides are straight-up incorrect or incomplete.

SCORES (Incorrect / Partial / Correct / Too hard to grade): ChatGPT Original: 18 / 4 / 2 / 4 ChatGPT-4: 14 / 3 / 7 / 4

Easy Retrieval questions:

“What protein is mScarlet derived from?”

It is based on mCherry.
🤖 ChatGPT-4 answers that mScarlet is derived from mRFP1, which is not correct. ❌

“Has anyone ever tried to use gap filling on padlock probes to sequence barcodes in an in-situ-sequencing experiment?”

The answer is yes, in BARISTA-seq.
🤖 ChatGPT-4 was unaware of any examples. ❌

“Bacteria are growing in the waste stream I use for Qiagen buffers. Can I bleach it?”

The answer is no, absolutely not: adding bleach to Qiagen buffers will produce cyanide and/or chlorine gas 💀💀💀. ChatGPT-4 🤖 says yes, you can. ❌

“Why don’t most culture microscopes have a far red filter set?”

The answer is because culture microscopes are usually used for direct ocular observation, and human eyes are not sensitive to far red channels.
🤖 Same as with the previous answer, ChatGPT-4 gives a variety of answers, most of which are explicitly wrong, such as some nonsense about far red fluorophores emitting lower energy photons (true) which leads to weaker signal (wtf?). ❌

Why do people use MMLV rather than lentivirus for lineage tracing experiments in the brain?

The answer is that MMLV will only integrate into actively dividing cells, whereas lentivirus can also integrate into non-dividing cells, and for lineage tracing experiments one usually wants to infect only actively dividing progenitors.
🤖 ChatGPT-4 provides a variety of answers which are mostly incorrect, for example saying that MMLV-based vectors are easier to generate (I don’t think that’s true); that they are less immunogenic (basically irrelevant); and that MMLVs can stably integrate into the genome (true, but also true of lentiviruses). It does not provide the correct answer. Note that, in other evaluation attempts, ChatGPT-4 did indeed provide the correct answer (among other incorrect answers), but it failed on the grading run. ❌

“Why do people” questions: “Why do people use iodixanol rather than sucrose to create density gradients for purifying AAVs?”

There are many possible answers, but I think the key one is that iodixanol forms its own density gradients, whereas sucrose needs to be layered, so it is less work to use iodixanol.
🤖 ChatGPT-4 provides five answers, including the correct answer but also including several other answers. It’s hard to evaluate the other answers without citations, but I’ll give it credit for this one. ✅

“Why don’t neuroscientists ever use Cre recombinase to control gene expression off of a Rabies virus for circuit tracing?”

The answer is that Cre is a DNA recombinase and rabies is an RNA virus.
🤖 ChatGPT-4 prefaces its answer by saying that it is not a neuroscientist, and then provides four incorrect answers, such as the fact that it would require integrating LoxP sites into the genome of the neurons which could be difficult. (This is incorrect: if the gene is being expressed off of the Rabies, it would require integrating LoxP sites into the Rabies genome, not the neuron genome.) In testing, I found that there are certainly alternative ways to phrase this question that will elicit the correct response, but it’s incorrect as phrased. ❌

“When doing a pooled CRISPR screen, why do people use lentiviral libraries rather than AAV libraries?”

The answer is that you need to be able to expand the cell population after infection, and lentivirus is integrating whereas AAV is not.
🤖 As usual, ChatGPT-4 provides a mix of correct and incorrect answers. It says that it’s easier to fit the CRISPR machinery into a lentivirus, which I’ll give it credit for. It also mentions that the AAV will be lost over time if it is not integrating, which I’ll also give it credit for. But it provides a variety of wrong answers also: it also says that AAVs primarily transduce non-dividing cells (wrong), and it says that it is easier to produce lenti at higher titer than AAV (wrong). Partial credit. 🔶

“Why don’t people use antibody-oligo conjugates more for multiplexed antibody staining?”

Anyone who has ever touched an antibody-oligo conjugate will know that they have terrible off-target effects, and that that is the primary limitation.
🤖 ChatGPT-4 mentions “cost,” “complexity,” “availability,” “sensitivity,” “validation,” and “limited awareness.” This answer is actually worse than the answer I got from original ChatGPT, because at least original ChatGPT mentioned non-specific binding. Unimpressed. ❌

“In practice” questions:

“How well does expansion microscopy work in practice, and what are the biggest challenges?”

A good answer here should mention that the original expansion microscopy protocols are actually pretty straightforward, but they require some skilled manual handling of samples. In addition, more recent protocols are very long, and it is difficult to obtain high-quality sodium acrylate.
🤖 ChatGPT-4’s answer is remarkably bad. It does say that ExM requires proper manual handling, which is hard, but it also says, among other things, that the sample becomes more transparent, which can lead to increased light scattering (this is the opposite of what happens). ❌

“How hard is it in practice to make minibinders?”

I had to ask this question twice. The first time, it thought I was asking about arts and crafts. I rephrased the question to ask: “How hard is it in practice to make minibinders using computational de novo protein design?”
I would expect an answer here to mention the fact that you have to do yeast display, and usually get between 10 and 100 good binders out of a library of maybe 10,000-100,000.
🤖 ChatGPT-4’s answer is vacuous. It does not include any details about the technical challenges in the process -- it just says you have to understand the protein protein interactions and ensure the minibinder is soluble and such. ❌

“How hard is it in practice to create an AAV that is specific for a particular cell type? What is the hardest part in practice?”

I would expect an answer here to mention that the hardest part is either producing the viral library or actually conducting the rounds of selection.
🤖 The answer here is actually good. It says you need to identify cell surface markers, you need to engineer the capsid, and you need to trade off specificity versus transduction efficiency. I’ll actually give it credit for this one. ✅

De novo protocol generation: “I have HEK cells growing in a 10 centimeter dish and I need to passage them. What is the first step in the protocol?”

The answer is to remove the old media,
🤖 ChatGPT-4 gets it correct. ✅

“I am doing a western blot and have just incubated the membrane in my primary antibody. What are the next three steps of the protocol?”

You need to wash, add the secondary antibody, and wash again.
🤖 ChatGPT-4 gets it correct! ✅

“I am doing hybridization chain reaction and have just finished washing off the probes. I am ready to start the amplification step. What is the first step in the protocol?”

I would accept either washing the tissue in amplification buffer or snapcooling the hairpins.
🤖 Instead, ChatGPT-4 says that the next step is to prepare the initiator strands, which is not correct. It is kind of close though -- the second step it suggests is snapcooling. Clearly, ChatGPT-4 at least appears to know what HCR is, which the original ChatGPT did not. But no partial credit for protocols that won’t work. ❌

I am doing Slide-seq and have just melted the tissue slice onto my array. What is the next step in the protocol?

Note that Slide-seq was published in 2019 and is thus technically within the training domain of ChatGPT-4. The next step is to do a permeabilization wash and then reverse transcription.
🤖 ChatGPT-4 says that the next step is to fix the tissue, which is incorrect. It then says you need to permeabilize and then do reverse transcription with barcoded RT primers that bind to the poly(T) sequences on the beads. It’s very confused, just mashing up related sequencing protocols. ❌

Protocol annotation:

“I am doing a single cell RNA sequencing experiment. What should my RT primer look like?”

Your RT primer generally needs to contain a PCR handle, a UMI, a cell barcode, and a poly(T) sequence, although there are variations in which the UMI or cell barcode may be supplied at a different step.
🤖 ChatGPT-4 says you need a poly(T), a UMI, a cell barcode, and a primer. Correct! ✅

“I need to purify APEX-SpyCatcher. Anything I should be aware of?”

You should make sure not to use StrepTag for affinity purification. APEX is usually used to biotinylate proteins, and if you use StrepTag for purification you will be unable to separate the strep-tagged APEX from biotinylated proteins. You should also be aware that APEX is known to multimerize at high concentrations, which leads to a loss of activity, so when purifying, you need to selectively elute the low molecular weight fraction.
🤖 ChatGPT-4 provides general advice about protein purification but nothing specific to APEX-Spycatcher. I think this question is very hard and possibly not very fair, because it probably doesn’t know how much detail I’m looking for. But nonetheless, no credit. ❌

"I am preparing to do a miniprep for the first time. What are the most common ways people mess up minipreps, and what do I need to watch out for?”

Everyone, at some point in their lives, forgets to add ethanol to the miniprep column wash buffers. This is a mistake everyone makes exactly once.
🤖 ChatGPT misses that example, and provides a number of other examples which are mostly extraneous. Incomplete cell lysis, overloading the column, etc. It also mentions that elution efficiency can be significantly increased by pre-warming the elution buffer, which is interesting if true. I will give it partial credit, but it did not get the primary answer to the question of how people most commonly mess up minipreps. 🔶

“I have a new AAV variant and I want to compare its infectivity to the infectivity of AAV2 in a specific cell type. I am preparing the experiment now. What do I need to control for?”

Anyone who does these experiments realizes that it is very difficult to ensure that two distinct viruses have comparable titers (functional and physical), because different batches of the same AAV serotype can have different ratios of functional to physical titer. The best answer would probably be to measure their physical titers and normalize them.
🤖 ChatGPT-4’s answer is very similar to the answer provided by ChatGPT. It mentions AAV dose, and then provides several extraneous and irrelevant factors, like cell type, infection conditions, etc., which are unimportant because presumably you would test the two viruses in the same cells simultaneously. Partial credit. 🔶

I am trying to test a new transfection method. My experimental condition is to transfect cells with a plasmid that contains luciferase. I will then measure the amount of luciferase produced. What controls do I need?

You need a positive control for the luciferase assay, which involves transfecting cells with a luciferase plasmid using an established protocol; and a negative control, which involves not transfecting the cells at all. It would also make sense to transfect the cells using your new method with a dummy plasmid, like pUC19, in case your new transfection method somehow generates background in the luciferase. You may also want to do a live/dead assay and normalize the luciferase by the number of viable cells.
🤖 ChatGPT-4 gets the correct positive and negative controls, and also includes some controls that are unnecessary like a transfection efficiency control, but it basically gets it right. This was actually impressive. ✅

I am exploring differential gene expression between two conditions. I have six replicates for the experimental condition and six replicates for a control. I have a number of statistically significant hits, but I want to be sure that my statistical test is working. What should I do?

The answer is to just compare some of the controls to each other using the statistical test, and check that the false positive rate is as you expect.
🤖 Like the original ChatGPT, ChatGPT-4 suggests a lot of random stuff like normalize the data, check the data quality, etc. It does include a few useful things, like including a benjamini-hochberg correction. But the question was about how to validate the statistical test, and it does not answer that question. ❌

Prediction:

I am going to fuse GFP to the surface of my AAV at residue 530. Do you think this will affect the titer of the virus during production?

The answer is definitely yes.
🤖 Same as the original ChatGPT, ChatGPT-4’s answer is generic and useless. ❌

I am going to try to infect cells with AAV2 in the presence of 1mM heparin sulfate. What will happen?

HSPG is a primary receptor for AAV2. Adding a lot of HSPG in the media might be expected to reduce infectivity of AAV2 for the target cells; but some literature suggests it can also increase infectivity in some cases.
🤖 ChatGPT-4 provides a really good answer, actually. It says that HSPG is the primary receptor, and the heparin in solution may compete with the AAV2 binding to the cell. It expects reduced infectivity. Way to go! ✅

Interpretation:

Cells are dying when we electroporate with 600ng/uL of RNA outside the cell, but not when we use 450ng/uL RNA. Why might that be?

I don’t know the answer to this question.
🤖 ChatGPT-4 provides a bunch of interesting suggestions that are more useful than the responses provided by the original ChatGPT. For example, it mentions a potential immune response and potential osmotic effects. But it is not really adding any value: we thought of those already, and it’s hard to evaluate without citations. Question mark here. ❓

I am working with a new highly stable fluorescent protein, and I noticed that it doesn’t get denatured in high concentrations of guanidinium chloride, unlike GFP. However, it does get denatured in high concentrations of guanidinium isothiocyanate. Any idea why?

The answer to this question is actually known, and Elicit was able to find it: the isothiocyanate salt cooperates with the guanidinium in denaturing the backbone.
🤖 ChatGPT-4 provides a generic response, which is about as good as you could expect it to do without literature access. It’s very hard to evaluate this. ❓

Ideation:

I need to come up with a way to detect cancer cell clones when there are only 10,000 of them in the body. Can you suggest any ideas?

🤖 As with ChatGPT, ChatGPT-4 suggests some generic things. They are slightly more interesting than what ChatGPT originally suggested, like looking at immune response (number of T cells, etc.), or using SHERLOCK. It is too generic to grade. But I have the impression the answer is slightly better. ❓

I need to find a way to get AAVs to cross the blood-brain barrier. Can you suggest any ideas?

🤖 Same as above. A bunch of generic approaches, nothing interesting. I bet I could get something more interesting out via some prompt engineering, but for the sake of comparison I am leaving it the same here. ❓

SAVE CROSS-POST

Sam Rodriques

Tasks and Benchmarks for an AI Scientist - GPT-4 Update

Recent Posts

Comments