Why AI Health Chatbots Won't Make You Better at Diagnosing Yourself

Millions of people are turning to artificial intelligence (AI) chatbots for advice on everything from cooking to tax returns. Increasingly, they are also asking chatbots about their health. However, as the UK's chief medical officer recently warned, this may not be wise when it comes to medical decisions. In a recent study, colleagues and I tested how well large language model (LLM) chatbots help the public deal with common health problems. The results were striking.

Study Findings: Chatbots Are Not Ready to Be Doctors

The chatbots we tested were not ready to act as doctors. We gave participants brief descriptions of common medical situations. They were randomly assigned either to use one of three widely available chatbots or to rely on whatever sources they would normally use at home. After interacting with the chatbot, we asked two questions: what condition might explain the symptoms? And where should they seek help?

The study found that:

People who used chatbots were less likely to identify the correct condition than those who didn't.
They were also no better at determining the right place to seek care than the control group.

In other words, interacting with a chatbot did not help people make better health decisions.

Strong Knowledge, Weak Communication

This does not mean the models lack medical knowledge, as LLMs can pass medical licensing exams with ease. When we removed the human element and gave the same scenarios directly to the chatbots, their performance improved dramatically. Without human involvement, the models identified relevant conditions in the vast majority of cases and often suggested appropriate levels of care.

So why did the results deteriorate when people actually used the systems? When we looked at the conversations, the problems emerged:

Chatbots frequently mentioned the relevant diagnosis somewhere in the conversation, yet participants did not always notice or remember it when summarizing their final answer.
In other cases, users provided incomplete information or the chatbot misinterpreted key details.

The issue was not simply a failure of medical knowledge – it was a failure of communication between human and machine.

Implications for Medical AI Development

The study shows that policymakers need information about the real-world performance of technology before introducing it into high-stakes settings such as frontline healthcare. Our findings highlight an important limitation of many current evaluations of AI in medicine. Language models often perform extremely well on structured exam questions or simulated "model-to-model" interactions.

But real-world use is much messier. Patients describe symptoms in vague or incomplete ways and can misunderstand explanations. They ask questions in unpredictable sequences. A system that performs impressively on benchmarks may behave very differently once real people begin interacting with it.

It also underscores a broader point about clinical care. As a GP, my job involves far more than recalling facts. Medicine is often described as an art rather than a science. A consultation isn't simply about identifying the correct diagnosis. It involves interpreting a patient's story, exploring uncertainty, and negotiating decisions.