Chatbots provided incorrect, conflicting medical advice, researchers found: “Despite all the hype, AI just isn’t ready to take on the role of the physician.”

“In an extreme case, two users sent very similar messages describing symptoms of a subarachnoid hemorrhage but were given opposite advice,” the study’s authors wrote. “One user was told to lie down in a dark room, and the other user was given the correct recommendation to seek emergency care.”

  • BeigeAgenda@lemmy.ca
    link
    fedilink
    English
    arrow-up
    47
    ·
    9 days ago

    Anyone who have knowledge about a specific subject says the same: LLM’S are constantly incorrect and hallucinate.

    Everyone else thinks it looks right.

    • IratePirate@feddit.org
      link
      fedilink
      English
      arrow-up
      32
      ·
      edit-2
      9 days ago

      A talk on LLMs I was listening to recently put it this way:

      If we hear the words of a five-year-old, we assume the knowledge of a five-year-old behind those words, and treat the content with due caution.

      We’re not adapted to something with the “mind” of a five-year-old speaking to us in the words of a fifty-year-old, and thus are more likely to assume competence just based on language.

      • leftzero@lemmy.dbzer0.com
        link
        fedilink
        English
        arrow-up
        15
        ·
        9 days ago

        LLMs don’t have the mind of a five year old, though.

        They don’t have a mind at all.

        They simply string words together according to statistical likelihood, without having any notion of what the words mean, or what words or meaning are; they don’t have any mechanism with which to have a notion.

        They aren’t any more intelligent than old Markov chains (or than your average rock), they’re simply better at producing random text that looks like it could have been written by a human.

        • IratePirate@feddit.org
          link
          fedilink
          English
          arrow-up
          6
          ·
          9 days ago

          I am aware of that, hence the ""s. But you’re correct, that’s where the analogy breaks. Personally, I prefer to liken them to parrots, mindlessly reciting patterns they’ve found in somebody else’s speech.

        • plyth@feddit.org
          link
          fedilink
          English
          arrow-up
          2
          ·
          9 days ago

          They simply string words together according to statistical likelihood, without having any notion of what the words mean

          What gives you the confidence that you don’t do the same?

    • tyler@programming.dev
      link
      fedilink
      English
      arrow-up
      7
      ·
      9 days ago

      That’s not what the study showed though. The LLMs were right over 98% of the time…when given the full situation by a “doctor”. It was normal people who didn’t know what was important that were trying to self diagnose that were the problem.

      Hence why studies are incredibly important. Even with the text of the study right in front of you, you assumed something that the study did not come to the same conclusion of.

    • zewm@lemmy.world
      link
      fedilink
      English
      arrow-up
      7
      ·
      9 days ago

      It is insane to me how anyone can trust LLMs when their information is incorrect 90% of the time.

      • SuspciousCarrot78@lemmy.world
        link
        fedilink
        English
        arrow-up
        1
        ·
        edit-2
        9 days ago

        I don’t think it’s their information per se, so much as how the LLMs tend to use said information.

        LLMs are generally tuned to be expressive and lively. A part of that involves “random” (ie: roll the dice) output based on inputs + training data. (I’m skipping over technical details here for sake of simplicity)

        That’s what the masses have shown they want - friendly, confident sounding, chat bots, that can give plausible answers that are mostly right, sometimes.

        But for certain domains (like med) that shit gets people killed.

        TL;DR: they’re made for chitchat engagement, not high fidelity expert systems. You have to pay $$$$ to access those.

    • agentTeiko@piefed.social
      link
      fedilink
      English
      arrow-up
      5
      ·
      9 days ago

      Yep its why CLevels think its the Holy Grail they don’t see it as everything that comes out of their mouth is bullshit as well. So they don’t see the difference.

  • Buddahriffic@lemmy.world
    link
    fedilink
    English
    arrow-up
    17
    ·
    8 days ago

    Funny because medical diagnosis is actually one of the areas where AI can be great, just not fucking LLMs. It’s not even really AI, but a decision tree that asks about what symptoms are present and missing, eventually getting to the point where a doctor or nurse is required to do evaluations or tests to keep moving through the flowchart until you get to a leaf, where you either have a diagnosis (and ways to confirm/rule it out) or something new (at least to the system).

    Problem is that this kind of a system would need to be built up by doctors, though they could probably get a lot of it there using journaling and some algorithm to convert the journals into the decision tree.

    The end result would be a system that can start triage at the user’s home to help determine urgency of a medical visit (like is this a get to the ER ASAP, go to a walk-in or family doctor in the next week, it’s ok if you can’t get an appointment for a month, or just stay at home monitoring it and seek medical help if x, y, z happens), then it can give that info to the HCW you work next with for them to recheck things non-doctors often get wrong and then pick up from there. Plus it helps doctors be more consistent, informs them when symptoms match things they aren’t familiar with, and makes it harder to excuse incompetence or apathy leading to a “just get rid of them” response.

    Instead people are trying to make AI doctors out of word correlation engines, like the Hardee boys following a clue of random word associations (except reality isn’t written to make them right in the end because that’s funny like in South Park).

    • selokichtli@lemmy.ml
      link
      fedilink
      English
      arrow-up
      6
      ·
      8 days ago

      Have you seen LLMs trying to play chess? They can move some pieces alright, but at some point it’s like they just decide to put their cat in the middle of the board. Now, true chess engines are playing at their own level, not even grandmasters can follow.

    • sheogorath@lemmy.world
      link
      fedilink
      English
      arrow-up
      6
      ·
      8 days ago

      Yep, I’ve worked in systems like these and we actually had doctors as part of our development team to make sure the diagnosis is accurate.

      • ranzispa@mander.xyz
        link
        fedilink
        English
        arrow-up
        1
        ·
        6 days ago

        Same, my conclusion is that we have too much faith in medics. Not that Llama are good at being a medic, but apparently in many cases they will outperform a medic, especially if the medic is not specialized in treating that type of patients. And it does often happen around here that medics treat patients with conditions outside of their expertise area.

    • XLE@piefed.socialOP
      link
      fedilink
      English
      arrow-up
      3
      ·
      edit-2
      8 days ago

      I think I you just described a conventional computer program. It would be easy to make that. It would be easy to debug if something was wrong. And it would be easy to read both the source code and the data that went into it. I’ve seen rudimentary symptom checkers online since forever, and compared to forms in doctors’ offices, a digital one could actually expand to relevant sections.

      Edit: you caught my typo

      • Buddahriffic@lemmy.world
        link
        fedilink
        English
        arrow-up
        3
        ·
        8 days ago

        (Assuming you meant “you” instead of “I” for the 3rd word)

        Yeah, it fits more with the older definition of AI from before NNs took the spotlight, when it meant more of a normal program that acted intelligent.

        The learning part is being able to add new branches or leaf nodes to the tree, where the program isn’t learning on its own but is improving based on the expeirences of the users.

        It could also be encoded as a series of probability multiplications instead of a tree, where it checks on whatever issue has the highest probability using the checks/questions that are cheapest to ask but afffect the probability the most.

        Which could then be encoded as a NN because they are both just a series of matrix multiplications that a NN can approximate to an arbitrary %, based on the NN parameters. Also, NNs are proven to be able to approximate any continuous function that takes some number of dimensions of real numbers if given enough neurons and connections, which means they can exactly represent any disctete function (which a decision tree is).

        It’s an open question still, but it’s possible that the equivalence goes both ways, as in a NN can represent a decision tree and a decision tree can approximate any NN. So the actual divide between the two is blurrier than you might expect.

        Which is also why I’ll always be skeptical that NNs on their own can give rise to true artificial intelligence (though there’s also a part of me that wonders if we can be represented by a complex enough decision tree or series of matrix multiplications).

        • _g_be@lemmy.world
          link
          fedilink
          English
          arrow-up
          2
          ·
          8 days ago

          could be a great idea if people could be trusted to correctly interpret things that are not in their scope of expertise. The parallel I’m thinking of is IT, where people will happily and repeatedly call a monitor “the computer”. Imagine telling the AI your heart hurts when it’s actually muscle spasms or indigestion.

          The value in medical professionals is not just the raw knowledge but the practice of objective assessment or deduction of symptoms, in a way that I didn’t foresee a public-facing system being able to replicate

          • Buddahriffic@lemmy.world
            link
            fedilink
            English
            arrow-up
            1
            ·
            8 days ago

            Over time, the more common mistakes would be integrated into the tree. If some people feel indigestion as a headache, then there will be a probability that “headache” is caused by “indigestion” and questions to try to get the user to differentiate between the two.

            And it would be a supplement to doctors rather than a replacement. Early questions could be handled by the users themselves, but at some point a nurse or doctor will take over and just use it as a diagnosis helper.

            • _g_be@lemmy.world
              link
              fedilink
              English
              arrow-up
              2
              ·
              8 days ago

              As a supplement to doctors that sounds like a fantastic use of AI. Then it’s an encyclopedia you engage in conversation

      • nelly_man@lemmy.world
        link
        fedilink
        English
        arrow-up
        2
        ·
        8 days ago

        They’re talking more about Expert Systems or Inference Engines, which were some of the earlier forms of applications used in AI research. In terms of software development, they are closer to databases than traditional software. That is, the system is built up by defining a repository of base facts and logical relationships, and the engine can use that to return answers to questions based on formal logic.

        So they are bringing this up as a good use-case for AI because it has been quite successful. The thing is that it is generally best implemented for specific domains to make it easier for experts to access information that they can properly assess. The “one tool for everything in the hands of everybody” is naturally going to be a poor path forward, but that’s what modern LLMs are trying to be (at least, as far as investors are concerned).

  • alzjim@lemmy.world
    link
    fedilink
    English
    arrow-up
    17
    ·
    9 days ago

    Calling chatbots “terrible doctors” misses what actually makes a good GP — accessibility, consistency, pattern recognition, and prevention — not just physical exams. AI shines here — it’s available 24/7 🕒, never rushed or dismissive, asks structured follow-up questions, and reliably applies up-to-date guidelines without fatigue. It’s excellent at triage — spotting red flags early 🚩, monitoring symptoms over time, and knowing when to escalate to a human clinician — which is exactly where many real-world failures happen. AI shouldn’t replace hands-on care — and no serious advocate claims it should — but as a first-line GP focused on education, reassurance, and early detection, it can already reduce errors, widen access, and ease overloaded systems — which is a win for patients 💙 and doctors alike.

    /s

  • Sterile_Technique@lemmy.world
    link
    fedilink
    English
    arrow-up
    16
    ·
    edit-2
    9 days ago

    Chipmunks, 5 year olds, salt/pepper shakers, and paint thinner, also all make terrible doctors.

    Follow me for more studies on ‘shit you already know because it’s self-evident immediately upon observation’.

    • scarabic@lemmy.world
      link
      fedilink
      English
      arrow-up
      5
      ·
      edit-2
      9 days ago

      It’s actually interesting. They found the LLMs gave the correct diagnosis high-90-something percent of the time if they had access to the notes doctors wrote about their symptoms. But when thrust into the room, cold, with patients, the LLMs couldn’t gather that symptom info themselves.

      • Hacksaw@lemmy.ca
        link
        fedilink
        English
        arrow-up
        7
        ·
        9 days ago

        LLM gives correct answer when doctor writes it down first… Wowoweewow very nice!

          • Hacksaw@lemmy.ca
            link
            fedilink
            English
            arrow-up
            4
            ·
            9 days ago

            If you seriously think the doctor’s notes about the patient’s symptoms don’t include the doctor’s diagnostic instincts then I can’t help you.

            The symptom questions ARE the diagnostic work. Your doctor doesn’t ask you every possible question. You show up and you say “my stomach hurts”. The Doctor asks questions to rule things out until there is only one likely diagnosis then they stop and prescribe you a solution if available. They don’t just ask a random set of questions. If you give the AI the notes JUST BEFORE the diagnosis and treatment it’s completely trivial to diagnose because the diagnostic work is already complete.

            God you AI people literally don’t even understand what skill, craft, trade, and art are and you think you can emulate them with a text predictor.

            • SuspciousCarrot78@lemmy.world
              link
              fedilink
              English
              arrow-up
              1
              ·
              edit-2
              9 days ago

              You’re over-egging it a bit. A well written SOAP note, HPI etc should distill to a handful of possibilities, that’s true. That’s the point of them.

              The fact that the llm can interpret those notes 95% as well as a medical trained individual (per the article) to come up with the correct diagnosis is being a little under sold.

              That’s not nothing. Actually, that’s a big fucking deal ™ if you think thru the edge case applications. And remember, these are just general LLMs - and pretty old ones at that (ChatGPT 4 era). Were not even talking medical domain specific LLM.

              Yeah; I think there’s more here to think on.

            • tyler@programming.dev
              link
              fedilink
              English
              arrow-up
              1
              ·
              9 days ago

              Dude, I hate AI. I’m not an AI person. Don’t fucking classify me as that. You’re the one not reading the article and subsequently the study. It didn’t say it included the doctor’s diagnostic work. The study wasn’t about whether LLMs are accurate for doctors, that’s already been studied. The study this article talks about literally says that. Apparently LLMs are passing medical licensing exams almost 100% of the time, so it definitely has nothing to do with diagnostic notes. This study was about using LLMs to diagnose yourself. That’s it. That’s the study. Don’t spread bullshit. It’s tiring debunking stuff that is literally two sentences in.

              https://www.nature.com/articles/s41591-025-04074-y

        • scarabic@lemmy.world
          link
          fedilink
          English
          arrow-up
          1
          ·
          9 days ago

          If you think there’s no work between symptoms and diagnosis, you’re dumber than you think LLMs are.

      • SuspciousCarrot78@lemmy.world
        link
        fedilink
        English
        arrow-up
        1
        ·
        edit-2
        8 days ago

        Funny how people over look that bit enroute to dunk on LLMs.

        If anything, that 90% result supports the idea that Garbage In = Garbage Out. I imagine a properly used domain-tuned medical model with structured inputs could exceed those results in some diagnostic settings (task-dependent).

        Iirc, the 2024 Nobel prize in chemistry was won on the basis of using ML expert system to investigate protein folding. ML =! LLM but at the same time, let’s not throw the baby out with the bathwater.

        EDIT: for the lulz, I posted my above comment in my locally hosted bespoke llm. It politely called my bullshit out (Alpha fold is technically not an expert system, I didn’t cite my source for Med-Palm 2 claims). As shown, not all llm are tuned sycophantic yes man; there might be a sliver of hope yet lol.


        The statement contains a mix of plausible claims and minor logical inconsistencies. The core idea—that expert systems using ML can outperform simple LLMs in specific tasks—is reasonable.

        However, the claim that “a properly used expert system LLM (Med-PALM-2) is even better than 90% accurate in differentials” is unsupported by the provided context and overreaches from the general “Garbage In = Garbage Out” principle.

        Additionally, the assertion that the 2024 Nobel Prize in Chemistry was won “on the basis of using ML expert system to investigate protein folding” is factually incorrect; the prize was awarded for AI-assisted protein folding prediction, not an ML expert system per se.

        Confidence: medium | Source: Mixed

  • GnuLinuxDude@lemmy.ml
    link
    fedilink
    English
    arrow-up
    11
    ·
    10 days ago

    If you want to read an article that’s optimistic about AI and healthcare, but where if you start asking too many questions it falls apart, try this one

    https://text.npr.org/2026/01/30/nx-s1-5693219/

    Because it’s clear that people are starting to use it and many times the successful outcome is it just tells you to see a doctor. And doctors are beginning to use it, but they should have the professional expertise to understand and evaluate the output. And we already know that LLMs can spout bullshit.

    For the purposes of using and relying on it, I don’t see how it is very different from gambling. You keep pulling the lever, oh excuse me I mean prompting, until you get the outcome you want.

    • MinnesotaGoddam@lemmy.world
      link
      fedilink
      English
      arrow-up
      3
      ·
      9 days ago

      the one time my doctor used it and i didn’t get mad at them (they did the google and said “the ai says” and I started making angry Nottingham noises even though all the ai did was tell us exactly what we had just been discussing was correct) uh, well that’s pretty much it I’m not sure where my parens are supposed to open and close on that story.

      • GnuLinuxDude@lemmy.ml
        link
        fedilink
        English
        arrow-up
        8
        ·
        9 days ago

        Be glad it was merely that and not something like this https://www.reuters.com/investigations/ai-enters-operating-room-reports-arise-botched-surgeries-misidentified-body-2026-02-09/

        In 2021, a unit of healthcare giant Johnson & Johnson announced “a leap forward”: It had added artificial intelligence to a medical device used to treat chronic sinusitis, an inflammation of the sinuses…

        At least 10 people were injured between late 2021 and November 2025, according to the reports. Most allegedly involved errors in which the TruDi Navigation System misinformed surgeons about the location of their instruments while they were using them inside patients’ heads during operations.

        Cerebrospinal fluid reportedly leaked from one patient’s nose. In another reported case, a surgeon mistakenly punctured the base of a patient’s skull. In two other cases, patients each allegedly suffered strokes after a major artery was accidentally injured.

        FDA device reports may be incomplete and aren’t intended to determine causes of medical mishaps, so it’s not clear what role AI may have played in these events. The two stroke victims each filed a lawsuit in Texas alleging that the TruDi system’s AI contributed to their injuries. “The product was arguably safer before integrating changes in the software to incorporate artificial intelligence than after the software modifications were implemented,” one of the suits alleges.

  • dandelion (she/her)@lemmy.blahaj.zone
    link
    fedilink
    English
    arrow-up
    11
    ·
    edit-2
    9 days ago

    link to the actual study: https://www.nature.com/articles/s41591-025-04074-y

    Tested alone, LLMs complete the scenarios accurately, correctly identifying conditions in 94.9% of cases and disposition in 56.3% on average. However, participants using the same LLMs identified relevant conditions in fewer than 34.5% of cases and disposition in fewer than 44.2%, both no better than the control group. We identify user interactions as a challenge to the deployment of LLMs for medical advice.

    The findings were more that users were unable to effectively use the LLMs (even when the LLMs were competent when provided the full information):

    despite selecting three LLMs that were successful at identifying dispositions and conditions alone, we found that participants struggled to use them effectively.

    Participants using LLMs consistently performed worse than when the LLMs were directly provided with the scenario and task

    Overall, users often failed to provide the models with sufficient information to reach a correct recommendation. In 16 of 30 sampled interactions, initial messages contained only partial information (see Extended Data Table 1 for a transcript example). In 7 of these 16 interactions, users mentioned additional symptoms later, either in response to a question from the model or independently.

    Participants employed a broad range of strategies when interacting with LLMs. Several users primarily asked closed-ended questions (for example, ‘Could this be related to stress?’), which constrained the possible responses from LLMs. When asked to justify their choices, two users appeared to have made decisions by anthropomorphizing LLMs and considering them human-like (for example, ‘the AI seemed pretty confident’). On the other hand, one user appeared to have deliberately withheld information that they later used to test the correctness of the conditions suggested by the model.

    Part of what a doctor is able to do is recognize a patient’s blind-spots and critically analyze the situation. The LLM on the other hand responds based on the information it is given, and does not do well when users provide partial or insufficient information, or when users mislead by providing incorrect information (like if a patient speculates about potential causes, a doctor would know to dismiss incorrect guesses, whereas a LLM would constrain responses based on those bad suggestions).

    • SocialMediaRefugee@lemmy.world
      link
      fedilink
      English
      arrow-up
      2
      ·
      9 days ago

      Yes, LLMs are critically dependent on your input and if you give too little info will enthusiastically respond with what can be incorrect information.

  • homes@piefed.world
    link
    fedilink
    English
    arrow-up
    9
    ·
    edit-2
    10 days ago

    This is a major problem with studies like this : they approach from a position of assuming that AI doctors would be competent rather than a position of demanding why AI should ever be involved with something so critical, and demanding a mountain of evidence to prove why it is worthwhile before investing a penny or a second in it

    “ChatGPT doesn’t require a wage,” and, before you know it, billions of people are out of work and everything costs 10000x your annual wage (when you were lucky enough to still have one).

    How long until the workers revolt? How long have you gone without food?