Evaluating Generative AI tools in the life sciences

Once a life sciences organization has figured out how they want to leverage Generative AI, their next challenge is determining how they want to implement it. For most life sciences organizations, this ends up being a question of choosing the right vendor.

Reaching the AI tool evaluation stage

With the myriad AI vendors that have sprung up as a result of the hype surrounding Generative AI, it can be time-consuming to evaluate options and unclear who is best-suited to support their organization’s needs.

A key difference between evaluating Generative AI and traditional software

Inevitably, many organizations ask for a demo of some sort in the course of evaluating AI tools. And while demos can be helpful, demos typically are only useful for giving potential buyers a general sense of how the tool works and are less helpful in offering a precise sense of how that tool actually works. This difference is more prominently on display in the Generative AI world due to the fundamentally different nature of the technology that Generative AI is built on compared to traditional software tools.

Traditional software tools generally rely on what is often known as deterministic software programs - a program that is designed to follow instructions the exact same way every single time. Generative AI tools, on the other hand, are often characterized as probabilistic software programs - programs that are designed to produce different outputs even when given the same input. As weird as this sounds, for various technical reasons, that is a big part of the magic that makes Generative AI systems like ChatGPT work. But this is also why Generative AI tools like ChatGPT might gloss over key information, make up information, or return different slightly different results even when you ask it the same question multiple times. Try it out for yourself on ChatGPT here.

This difference in technology makes the task of evaluating Generative AI demos a little trickier. Questions that naturally emerge when we consider how Generative AI differs from traditional software are often along the lines of:

How well will this product work with my data?
How consistently will the product work?
Where does a product that does not do the exact same thing every time fit into our workflow?

For organizations that are considering investing time and resources into AI, these are critical questions that significantly influence a life sciences organization’s choice of Generative AI products or partners or whether they even choose to adopt Generative AI in the first place.

The problem is that, without exposing an organization’s proprietary data to that product and without committing to it for at least some amount of time to understand how it works and fits into the organization, it can be difficult to answer those questions. And many organizations don’t feel comfortable making that commitment until they have answers to those problems. So many end up in a chicken-and-egg situation that overall can slow down the process of AI adoption. And for those that make it out of that situation, they are often disappointed that AI tools aren’t meeting expectations.

How to evaluate AI tools

The most useful thing to be equipped with going into an evaluation of a potential AI tool will be a decent intuition for the limitations of Generative AI tools (the limitations I briefly mentioned above). You can check out an introductory article on the topic written by some of our colleagues over at IBM. For those of you who want to dive more rigorously into the topic, here is a well-done Nature publication that can serve as a helpful starting point.

Beyond this, going into an evaluation with an eye towards answering questions in two main categories can be helpful:

How exactly does this product change the workflow? This is important because Generative AI tools have the potential to augment a workflow as much as they have the potential to speed it up - often both are true. Because AI tools have this probabilistic element to them, they won’t always neatly replace a discrete process in an organization’s workflow and will often augment workflows in ways end users who are familiar with traditional software tools might be less familiar with. Even a concept like “document-drafting” takes some effort to explain - what does it mean to have a system that drafts documents? When you engage with a first draft, are you editing it manually, regenerating sections with AI, or doing something else entirely?
What must the product do well to accelerate that changed workflow? This is the flip side of the same coin. Is the primary job of the tool to move information from one file to another? Is it to summarize something? Is it to complete some analytical work or draw conclusions? Where efficiencies are achieved in the workflow depend heavily on answers to these questions. While it might seem like all AI vendors do all of these things in their product, that’s often not the case. And, often, there are certain parts of the process that are more important to get right than others, or, at the very least, are prerequisite to others. Understanding the various steps AI follows and understanding which of those steps different AI systems do well are key to making the right choice of AI vendor: instead of broadly looking at how AI workflows appear in demos and hoping that the real product works the same way the demo does, taking this approach allows teams to look critically at the demo and ask questions about how the AI achieves very specific activities. While not foolproof, this certainly helps teams get a better sense of what the product is technically adept at doing; it gives teams something to focus on in a demo, which can be important, given that a lot of Generative AI demos are very text-heavy.

Asking these questions isn’t foolproof, but it’s also difficult for many organizations to get a more robust sense of how a Generative AI tool works without providing data that is proprietary and/or specific to the company. And that process can be both risky and resource-intensive, since other teams beyond the SMEs may have to get involved earlier in the process.

For those organizations that are still looking for a more rigorous evaluation, a relatively simple approach that can help better assess the real-life performance of a tool involves using redacted and/or public data from sources like clinicaltrials.gov or Drugs@FDA to better understand how an AI tool might perform in certain workflows on data that is specific to that company.

Conclusion

Evaluating Generative AI tools in the life sciences involves a nuanced approach that acknowledges the unique characteristics of these technologies. Understanding the probabilistic nature of Generative AI, thoroughly testing tools with relevant data, and focusing on specific workflow changes are crucial steps that organizations can take to de-risk AI adoption.