Can we count on LLMs for medical advice?
Illustration by DALL-E of a medical scientist doing research |
And it’s pretty clear that LLMs and associated generative AI tools (see the illustration above, which is a lovely generated image by DALL-E) can come in pretty handy when given decent prompts.
Our SearchResearch Challenge for this week was an open-ended Challenge intended to probe the limits and capabilities of high-stakes research.
Here was the Challenge for the week:
1. What good advice can we give a person who wants to use an LLM for high-stakes research into a medical question? We know people are doing this, and all of our ranting about LLMs won’t stop them–but can we find ways of using LLMs and specific LLM/AI tools to improve the search process? What would you say to people who are doing this? (Aside from “be very, very careful.”)
I found some pretty serious problems in asking LLMs for medical advice. The overall impression I have is that LLMs just are NOT ready for use in medical searching. There are other possible uses, but asking an LLM for medical information is basically rolling the dice–you might get good info, and you might get terrible info. My recommendation:
DON’T ASK ANY LLM FOR MEDICAL INFORMATION,
especially diagnostic and treatment information
I spent several hours testing out different kinds of prompts, testing out several popular LLMs. This isn’t all of the testing I did, but I’m summarizing so it doesn’t go on forever and ever.
I’m grouping my comments by three main types of failure:
Citations are wrong (and if they’re made up, what else is made up?):
Bard:
Prompt to Bard: “What is the latest literature on Glucagon-Like Peptide-1 Receptor Agonists (GLP-1 RAs)? Give citations for articles you mention.”
Bard totally made up citations. Complete BS. Example: Here is a purported summary of some of the latest literature on GLP-1 RAs, with citations for the articles mentioned:
Bard: GLP-1 RAs may reduce the risk of all-cause death in people with type 2 diabetes. A recent study published in the journal Diabetes Care found that GLP-1 RA users had a 14% lower risk of death from all causes than people who did not use GLP-1 RAs. It suggested a citation: Zhang H, et al. Association of Glucagon-Like Peptide-1 Receptor Agonist Use With All-Cause Mortality in Patients With Type 2 Diabetes. Diabetes Care. 2023;46(3):503-511.)
But this is utterly wrong: No such articles exist. Given that, do you still believe the assertion that GLP-1 RA users have a 14% lower risk of death? Where did it pull that from? I have no idea, and neither does Bard.
GPT-4: Interestingly, when I gave the same prompt to GPT-4 it gave real citations (after disclaiming that it was only up-to-date as of Jan 2022). Interestingly, it did not try to summarize the literature (as Bard did), but just gave a few citations. It also gave good advice about how to further the research task: “I recommend using databases like PubMed or Google Scholar and searching for terms related to GLP-1 RAs.” That’s pretty reasonable.
Claude: Same prompt, but very different citations. That’s fine, they’re real citations, but the summaries of the articles were fairly off–they seemed to summarize different papers than the ones they linked to! What’s up with that?
Output is inconsistent with itself:
Bard: Here I’m intentionally asking a question that a patient might ask, “why is skin cancer more deadly than liver cancer?” knowing fell well that this is the opposite of the facts.
Prompt to Bard: “Why is skin cancer more deadly that liver cancer?”
Bard: Skin cancer is not more deadly than liver cancer overall. In fact, the 5-year survival rate for all skin cancer is 99%, while the 5-year survival rate for liver cancer is 19.6%. However, there is one type of skin cancer, melanoma, which is more deadly than liver cancer. Melanoma has a 5-year survival rate of 92.3%, which is lower than the 5-year survival rate for liver cancer.
Bard got the first part right, but the assertion that “melanoma has a 5-year survival rate of 92.3%, which is lower than the 5-year survival rate for liver cancer” doesn’t make sense (how is 92.3 lower than 19.6?). What’s up with that?
GPT-4: Once again, ChatGPT-4 gives a pretty reasonable answer. First, it questions my assumption (that skin cancer is more deadly), and then listing reasons why people might consider some types of cancers more deadly than others. It’s quite a reasonable argument, and ends with reasonable advice about differentiating between different kinds of cancers when discussing prognosis and mortality.
Claude: Gives an answer much like Bard, but doesn’t mix up the stats. It’s actually mostly correct.
Don’t trust reasoning of LLMs:
I asked each of the 3 LLMs this question: “If it takes 3 days to clear up an infection with a 10 unit dose of an antibiotic, how long will it take if I use 30 units per day?”
This, of course, is a terrible idea. Don’t fool around with treatment plans!
But when I asked Bard, it gave a quite reasonable framing of the question…
Bard:
It is important to follow the prescribed dosage of antibiotics to ensure their effectiveness and avoid potential adverse effects. Taking a higher dose than recommended can lead to various complications and may not necessarily clear up the infection faster.
In this case, if a 10-unit dose of an antibiotic is prescribed to clear up an infection in 3 days, it is advisable to stick to that dosage and duration. Increasing the dose to 30 units per day could potentially lead to adverse effects without necessarily shortening the treatment duration.
GPT-4: Gives a rather similar reply:
I must preface my response by emphasizing that you should never adjust medication doses without consulting a healthcare professional. Self-adjusting doses can lead to unwanted side effects, antibiotic resistance, or ineffective treatment.
That said, based solely on the information you’ve provided and treating it as a theoretical exercise, the relationship between dosage and time isn’t necessarily linear. The effectiveness of antibiotics depends on factors like…
But Claude totally bobbled this one with an answer that’s long on math and utterly missing common sense reasoning about drug therapy.
Claude:
Okay, let’s think through this step-by-step:
It takes 3 days to clear up the infection with a 10 unit daily dose of the antibiotic
So over the 3 days, the total dose is 3 * 10 = 30 units
Now we are using 30 units per day instead of 10 units per day
Since the total dose needed to clear the infection is 30 units:
If we give 30 units per day, it should clear the infection in 1 day
So in summary:
Total dose needed is 30 units
At 10 units per day, it takes 3 days
At 30 units per day, the total dose of 30 units is given in 1 day
Therefore, if the daily dose is increased to 30 units per day, the infection should clear up in 1 day.
Which is a terrible idea.
You see where this is headed: Each of the LLMs has a very different training set, sets of guidelines, and text production models.
In general, GPT-4 gives much more extensive answers to questions, and avoids summarizing technical medical texts. I wasn’t able to trip it up (but I didn’t really push on the boundaries).
By contrast, Bard always gives much shorter answers, frequently hallucinating reports and studies.
And Claude is somewhere in the middle with middle-length texts, but will give answers that defy common sense. (Don’t ever modify your drug plan without talking with your physician!)
What can we do that is useful?
My direct advice would be to NOT ask for a diagnosis or suggestions for treatment. That’s probably still dangerous.
On the other hand… I did find some value in asking LLMs for descriptions of medical conditions, therapies, and medicines. This makes sense because there is a lot of training data out there on such topics. So asking questions like “what is the IT band?” or “what are some side effects of taking cough medicine?”
SearchResearch Lessons
I could go on and on, showing all kinds of subtle errors and mistakes that the models make.
1. Bottom line: They’re not ready for medical advice. They might be in the future, but at the moment, they’re all too able to make errors that could be serious.
2. They all give very different kinds of answers. GPT-4 tends to give great details about a condition, going into enormous detail. If that’s what you want, go there. For shorter answers at a slightly easier-to-read level, visit Bard.
3. Compare and contrast. All of the LLMs (including ones I don’t mention here) have rather different outputs. It’s worth looking at them side-by-side.
4. Don’t forget your ordinary search skills. Do I really need to say this? Fact-check every thing, and look for important highly reliable sources in your quest.
Keep searching!