Acknowledgments to Cameron Roots, Adam Marblestone, Tony Kulesa, and Logan Graham.
Background:
Machine learning has extraordinary potential to accelerate the pace of research in the biological sciences. Already, companies such as Recursion Therapeutics and Insitro have adopted machine learning for later-stage research, such as drug development. However, the impact of machine learning on early-stage, 0-to-1 invention and discovery research has been small by comparison. Examples of invention and discovery research include the original discovery of CRISPR and efforts to characterize it for genome editing; the basic work on lipid nanoparticles and RNA transfer into cells that enabled the COVID vaccines [1]; and recent efforts to build the first-ever brain-computer-interface to enable walking in people with spinal injuries [2]. Early stage research is usually carried out by universities or newer models like Focused Research Organizations [3], and companies are less likely to invest in early stage research due to the uncertain returns on investment. However, late-stage research is built upon early-stage research; if we want to accelerate the pace of biology research as a whole, it is essential that we figure out how to accelerate early-stage research.
It is challenging to apply machine learning to early-stage research in large part due the diversity of experimental workflows that are necessary in early-stage research. Concretely, it is relatively simple to apply machine learning to any individual data analysis workflow or experimental workflow, which is what many large drug development companies do. However, machine learning is very challenging or impossible to apply productively when researchers are doing new assays and developing new workflows every few weeks or every few months, as happens at the earliest stages of research. Other challenges include the lack of access of early-stage researchers to high-powered compute; the lack of standardization in datasets that are collected by early-stage researchers; and the fact that there are few organizations today outside of educational institutions where AI researchers and biology researchers can work together on early stage research. Going forward, however, we expect that new developments such as large language models will dramatically change the prospects for accelerating early-stage biology research using AI and machine learning. In particular, we expect:
New semi-autonomous AI research assistants (“agents”) will become capable of executing some of the tasks that human researchers do in the course of their work, thus increasing the productivity of the entire early-stage biological research sector. In the past decade, reinforcement-learning techniques have enabled AI agents to perform some remarkable feats, such as beating computer games or the board game Go, but these feats have been restricted to areas with relatively well-defined objective functions. More recently, agents that use language models as their core controller have been shown to execute successfully at more nebulous tasks (described in natural language), including in science [4]. For example, recent experiments have shown that when AIs are provided with access to various “tools” in the form of APIs, they are capable of executing complex behaviors, such as designing a molecule for synthesis based on a high-level prompt; ordering the reagents needed for synthesis; and then synthesizing the molecule [5]. As language models become more sophisticated, we anticipate that “language-based agents” will become capable of ever more sophisticated tasks, and will eventually even take some of the time-consuming work of research off the backs of the scientists. They may also be capable of The proliferation of these AI research assistants stands to dramatically increase the productivity of the early stage biology research sector within 5-10 years.
Robotic laboratory automation will enable closed-loop experimentation, greatly increasing the throughput of hypothesis testing. Today, researchers are limited by the number of hours they can spend in labs and the number of experiments they can hold in their heads at once. Robotic laboratory automation, especially when connected to AI research assistant technologies, will enable massively parallel experimentation and rapid iteration, and will likely be more traceable and reproducible than experiments that are run manually [6]. Today, we have already begun to see the development of specialized closed-loop robotics systems for protein engineering and chemistry. Development of more general purpose laboratory robotics will accelerate this trend, and integration of language models with computer vision and robotic embodiment will also likely enhance the reasoning capabilities of AI research assistants and allow them to execute some experiments semi-autonomously.
Biological foundation models will increase the value of large, high-quality datasets. Biology is full of complex abstractions (cells, organisms) that consist of many independently observable components (RNA, DNA, protein, metabolites, etc.). Presently, these observations are largely independent: it is not possible, for example, to predict the protein content of a cell from its RNA content, or to predict how adding a given small molecule to a cell will change its gene expression, etc. Moreover, the relevant datasets are usually collected independently, and datasets gathered by early stage researchers often lack interoperability due to differences in access protocols, format, or preprocessing. Going forward, we expect that language models and language-based tools will greatly facilitate combining data across multiple datasets, for example by automatically identifying datasets and processing documentation regarding the way the datasets were gathered and stored [7]. We also expect that multimodal models trained on these joint datasets will provide valuable information to inform many different experiments and hypotheses, and will provide the opportunity to interpolate between these various abstraction layers, greatly increasing our ability to understand and cure disease states [8].
Recommendations:
Private investment will not be sufficient to drive the development of machine learning-powered technologies for early-stage research. In order to support the development of AI in early-stage biology research, the US must invest in the following core areas:
Integrated research environments: Today, there are few if any environments that enable biology researchers and AI researchers to work together on early stage research. Most invention and discovery research in biology takes place in universities or university-adjacent research institutes, which are incapable of paying the salaries necessary to attract top AI talent, and which lack sufficient computational capacity for most cutting edge AI projects. Although companies can and often do employ both biology researchers and AI researchers, they are typically focused on later-stage research. The US should sponsor dedicated non-academic research organizations and institutes that are capable of hiring top talent in both AI and biology research, and that will enable those researchers to work together on early stage research concepts.
Publicly-available scientific literature: The development of AI technologies for science depends critically on access to the full-text scientific literature, which is currently controlled by private companies. The publicly-funded scientific literature must be made available to researchers under reasonable licensing conditions, to enable the creation of AI research assistants that are knowledgeable about science and that benefit the public good. Moreover, efforts must be made to standardize the tho format of public datasets and of the scientific literature, and to scale up quality control and fraud detection efforts.
Public-sector computational resources and foundation models for biology: Many of the most promising technologies in the application of foundation models today, such as RLHF, cannot be applied without access to the underlying weights of the model. Moreover, private sector entities are usually unwilling to provide access to the weights of their trained models. Thus, if we want to accelerate the adoption of cutting edge machine learning technologies in early-stage biology research, the public sector and charitable partners must invest in dedicated compute clusters consisting of at least thousands or tens of thousands of GPUs, and foundation models that can be made available to the scientific research community. These models should still be controlled in their distribution for safety and security reasons, but should be made available to researchers and trusted research entities, perhaps on a project basis as with existing national lab computing resources. Cutting edge models must be trained on language, as well as on specific biological data types, such as protein sequences, genomics, microscopy, etc.
APIs to public-sector tools: In addition to foundation models, the US must invest in building technologies (e.g. APIs) to enable AI research assistants to access and interact with public resources, such as Genbank and BLAST. These tools will be foundational for AI-assisted biology research in the future, and will require government investment to deploy.
Robotics: Finally, current robotic technologies are woefully inadequate for performing most tasks in the wet lab [9]. Moreover, laboratory automation is of little value to robotics companies, when compared to mass market applications such as semiconductor fabrication and automotive manufacturing, and as a result, there is relatively little incentive to apply cutting edge robotics technologies to laboratory automation. The US should invest in partnerships with private sector entities towards developing self-driving lab technologies that can execute entire workflows in biology and chemistry from end to end.
Biosafety and biosecurity: Finally, the US must develop a comprehensive framework for managing biosafety and biosecurity risks while also allowing research into the applications of AI in biology to proceed. There are very concrete risks associated with emerging AI technologies in biology which must be managed, and those risks are particularly salient in the wake of the COVID pandemic. If concerns about biosafety and biosecurity are not actively addressed, they may shut down efforts in the US to accelerate biology research and lead to a slowdown similar to that seen with concerns over recombinant DNA. It is thus essential to develop a clear framework now for evaluating and managing risks, and that process must be led by subject matter experts in the relevant biology and AI communities. The government should invest in technologies that enable screening and control of potential biorisks at the point of physical synthesis; should develop a set of guidelines to be followed by the AI for biology research community; and should invest in studying mitigations and defensive technologies.
With sufficient investment in these areas, the US can establish itself as the clear leader in the application of AI to early-stage research, which will further cement the dominance of the US biomedical research community and industry.
[4] See, for example, https://lilianweng.github.io/posts/2023-06-23-agent/
[7] https://www.biorxiv.org/content/10.1101/2023.06.14.544984v1.abstract, https://arxiv.org/abs/2304.01420
[8] Thanks to Adam Marblestone and Tony Kulesa for their contributions to this bullet point.
Comentarios