An MIT study finds non-clinical information in patient messages, like typos, extra whitespace, or colorful language, can reduce the accuracy of a large language model deployed to make treatment recommendations. The LLMs were consistently less accurate for female patients, even when all gender markers were removed from the text.
Why are they… why are they having autocomplete recommend medical treatment? There are specialized AI algorithms that already exist for that purpose that do it far better (though still not well enough to even assist real doctors, much less replace them).
Are there any studies done (or benchmarks) that show accuracy on recommendations for treatments given a medical history and condition requiring treatment?