top of page
  • samrodriques

Questions to ask about building foundation models

I often hear proposals to build new foundation models for biology. Here is the list of questions I ask. I rarely get past question 1.

  1. What is the core task that the model performs/the core thing the model predicts? I.e., on a single iteration of the model, what is the input and what is the output?

    1. For LLMs, the answer is "you give it N words and it provides a probability distribution over word N+1"

    2. My experience is that 90% of proposals fail at this stage.

  2. What is the loss metric on the core task?

    1. For LLMs, the answer is cross-entropy or perplexity

  3. What emergent behavior do you think the model will have?

    1. For LLMs, the answer is "natural language processing." For other people, it might be "predicting enzyme functionality" or something.

  4. What are the evals you will use to evaluate whether this emergent behavior is emerging?

    1. For LLMs, e.g. Winogrande

  5. Is there evidence that performance on the evals increases with scale? How expensive is it to get that evidence?

  6. Questions about datasets.

    1. Do they exist?

    2. How many tokens?

    3. Are they high quality? How does the person know if they are high quality?

  7. More technical questions

    1. Tokenization

    2. What is the scaling method?

    3. Architecture questions. Dense? Mixture of experts? Multiple read heads? etc.

986 views0 comments

Recent Posts

See All

Protocols are long (and full of horrors)

I think the reason automating science is hard is that protocols often have 20+ steps; but each individual step will fail or require modification in 5% or 10% of cases. So the probability you get all t


bottom of page