Large language models (LLMs) are artificial intelligence (AI)algorithms⁽¹⁾ that are trained on vast⁽²⁾ amounts of data to learn patterns that enable them to generate human-like responses. Reasoning models are LLMs with the added capability of working through problems step by step before responding, thus mirroring structured thinking. Such AI systems have performed well in assessing medical knowledge, but whether they can match physician-level clinical reasoning on authentic⁽³⁾ diagnostic tasks remains largely unknown. On page 524 of this issue, Brodeur et al. (1) demonstrate⁽⁴⁾ that AI can now seemingly match or exceed physician-level clinical diagnostic reasoning on text-based scenarios by measuring against human physician performances on clinical vignettes and real-world emergency cases. The findings indicate an urgent need to understand how these tools can be safely integrated⁽⁵⁾ into clinical workflows, and a readiness for prospective⁽⁶⁾ evaluation alongside clinicians.
大语言模型(LLM)是一类人工智能算法,通过海量数据训练学习语言规律,进而生成类人回复。推理模型属于大语言模型,额外具备逐步推演问题再给出答案的能力,以此模拟结构化思维。这类人工智能系统在医学知识测评中表现优异,但它们能否在真实诊断任务中达到执业医师级别的临床推理水平,目前仍尚不明确。本期杂志第524页,布罗德尔等人的研究表明:通过对照执业医师在临床案例实录和真实急诊病例中的表现来测评,如今人工智能在文本场景下的临床诊断推理能力,已近乎持平甚至超越医师水平。该研究结果意味着,当下亟需厘清如何将这类工具安全融入临床工作流程,并做好与临床医师协同开展前瞻性评估的准备。
AI has the potential to support a broad range of health care applications, from clinical decisions to medical education and the provision of patient-facing health information. LLMs have passed medical licensing examinations and performed well on structured clinical assessments, raising the prospect that they could help alleviate⁽⁷⁾ global health care workforce shortages. However, passing examinations is not the same as being a doctor, and demonstrating physician-level performance on authentic clinical tasks is a fundamentally harder challenge (2).
人工智能有望支撑医疗领域的各类应用场景,涵盖临床决策、医学教育以及面向患者的健康信息服务等。大语言模型已通过执业医师资格考试,且在标准化临床测评中表现出色,这让人们看到其有望缓解全球医疗人才短缺的前景。然而,通过考试并不等同于具备行医能力,在真实临床任务中展现出医师级水准,从根本上来说是一项难度大得多的挑战。
Brodeur et al. evaluated⁽⁸⁾ OpenAI’s first reasoning model, o1-preview (released in September 2024), across five experiments that assess diagnostic performance on clinical case vignettes against physician and prior-model baselines. A sixth experiment compared o1 with prior models, and physicians across three diagnostic touchpoints on 76 actual emergency department cases. Across the experiments, the o1 models substantially⁽⁹⁾outperformed⁽¹⁰⁾ prior-generation nonreasoning LLMs (e.g., GPT-4) and, in many cases, the physicians themselves.
布罗德尔等人针对OpenAI首款推理模型o1-preview(2024年9月发布)开展了五项实验,以临床案例文本为测评载体,对照执业医师与旧版模型的基准水平,评估其诊断能力。第六项实验选取 76 例真实急诊科病例,从三个诊断关键节点对比o1模型、旧版大语言模型及执业医师的表现。多项实验结果显示,o1模型大幅优于前代非推理类大语言模型(如GPT-4),且在诸多场景中表现超越执业医师。
......