Confidence that artificial intelligence (AI) technologies work in real-world clinical settings.

This section discusses the required evidence and the processes used to determine how well an artificial intelligence (AI) technology performs according to its intended use (referred to as efficacy). Interviewees for this research highlighted that these contribute to establishing the trustworthiness of, and hence increasing confidence in AI technologies.

Formal requirements for evidence in AI technologies are still being developed.

Evidence of an AI’s efficacy can be captured in several stages. AI technologies can progress through these stages during their development and deployment.

  1. Internal validation: The AI model is tested by its developer using a separate validation data set, often split from the same source as the training data set. It generally uses retrospective data sets (data that has been collected in the past).
  2. External validation: The AI model is tested with data from a different source to the training data set. It tests the generalisability of the model’s performance to clinically relevant scenarios, ensuring the performance that has been validated internally is maintained. External validation may be performed by the AI developer, or independently by a third party.
  3. Local validation: In some cases, limited further external validation may be needed as part of deploying AI at a local setting, to ensure its performance translates to the local data, patient populations and clinical scenarios. Procurement or commissioning entities may decide that the available evidence for an AI technology does not provide sufficient confidence that its performance will be acceptable in the local situation (local validation is discussed in section 4.3).
  4. Prospective clinical studies: The AI model is tested in a real-world clinical setting using data collected in real time. This evaluation determines whether the technology has benefits in terms of efficiency or patient outcomes. It involves testing the AI’s technical performance as well as its integration with clinical workflows.
  5. Ongoing monitoring: Healthcare settings should monitor the performance of AI algorithms in use to ensure there is no degradation due to population drift or technical factors elsewhere in the data pipeline.

Research has shown that many AI models that perform well at the internal validation stage perform significantly worse at the external validation stage.28 This may be due to the model being insufficiently ‘generalisable’; in other words, the model does not replicate its success when given data different to its original source. 

Further, products that perform well on both external and internal validation can perform poorly in prospective clinical studies.29 There can be many reasons for this poor performance, including human factors, technology infrastructure, and systems considerations.

Current approaches

Interviewees for this research noted that uncertainty in the appropriate evidence required for AI technologies can have a negative impact on the confidence of those commissioning and using AI technologies.

Current Medicines and Healthcare products Regulatory Agency (MHRA) guidelines for UK Conformity Assessment (UKCA) regulatory approval stipulate the need for internal validation testing. Although the requirement for external validation (using data from a different source to the training data) may be inferred from the UKCA clinical evaluation requirements, there is currently no requirement for this validation to be conducted by an independent third party.30 Prospective clinical studies are not currently a requirement for regulatory approval for software as a medical device (SaMD).

Interviewees for this research, supported by literature, observed that although most AI technologies being considered by NHS settings have gone through internal validation, few have been externally validated by third parties or undergone prospective clinical studies.31,32 Interviewees observed that very few industry innovators are committed to long-term prospective clinical studies (which demand time and monetary resources). Instead, many present regulatory certifications (such as UKCA or CE certification) as evidence of their technology’s performance.

Procurers of AI technologies and academics interviewed for this research suggested that the use of regulatory certification as evidence of product efficacy is inadequate for algorithms used in clinical decision making. They recommended that external validation and prospective clinical studies be required for UKCA approval, with evidence made publicly available.

Additional evidence is already required for AI used in national screening programmes. The UK National Screening Committee stipulates that AI screening technology must undergo external validation with further clinical evaluation being required in certain circumstances.33

Interviewees, supported by research, highlighted that prospective clinical studies are a particularly important driver for establishing confidence in AI technology that directly impacts clinical decisions or patient outcomes.9,34 These prospective studies can identify the system benefits, such as saving money, freeing up staff time for other tasks, or enabling a shift from inpatient to outpatient care.

Clinician interviewees for this research perceived that AI technologies used in patient care should be held to the same evidence standards as other medical interventions, such as pharmaceuticals. This includes systematic reviews, randomised controlled trials (RCTs) and peer review research. Revision and redesign of existing information technology (IT) and data governance infrastructures may be needed to support this level of evidence generation.

On the flip side, industry innovators interviewed for this research expressed concerns that traditional prospective clinical study methods such as RCTs require significant financial investment and long-time scales. They suggested that alternative evidence generation methods could be considered to ensure the healthcare system can benefit from new technologies sooner rather than later.

Several standards and tools have or are being developed for medical devices and clinical research to address the issues of varying evidence standards for AI products. Box 3 provides a few examples.

Box 3: Evaluation and validation standards and tools

The National Institute for Health and Care Excellence (NICE) evidence standards framework

The NICE evidence standards framework35 for digital health technologies is being updated to include AI-specific evidence guidelines dependent on the level of clinical risk associated with the technology. These aim to demonstrate that AI technologies are clinically effective and offer economic value, and could become the benchmark of evidence for AI technologies used in the NHS.

META tool

The META (MedTech Early Technical Assessment) Tool36 has been developed by NICE in collaboration with Health Innovation Manchester. It is a tool that can be used by developers of AI technologies to identify the gaps in evidence generation required to demonstrate their value to the NHS. It can help products progress to a successful NICE Medical Technologies Evaluation Programme or Diagnostics Assessment Programme application.

HealthTech connect

HealthTech Connect is a secure online system managed by NICE, with funding from NHS England and NHS Improvement, which is designed to identify and support health technologies as they move from inception through to adoption. A range of partner organisations contribute including the Academic Health Science Network (AHSN), Office for Life Sciences, the MHRA, industry associations (ABHI, AXREM, BIVDA), the National Institute for Health Research (NIHR) and NHS Clinical Commissioners. The system helps companies to understand what information is needed by decision makers in the UK health and care system (including levels of evidence) and clarifies possible routes to market access.

AHSN innovation exchange

The AHSN innovation exchange connects researchers, the life sciences industry and healthcare. One of its four structured elements relates to the validation of AI products in clinical settings. Through this process, AHSNs are developing a consistent approach to validate AI technologies.

AI evaluation and clinical trial reporting guidelines and tools

Several guidelines and tools have been developed for conducting and reporting external evaluation studies and prospective clinical trials for AI technologies. These are the result of an international collaborative effort to improve the transparency and completeness of reporting of evaluations and clinical trials of AI interventions. A few examples, some currently in development, include:

  • TRIPOD-AI37 (Transparent Reporting of a multivariable prediction model of Individual Prognosis Or Diagnosis-Artificial Intelligence)
  • STARD-AI38 (Standards for Reporting of Diagnostic Accuracy Study-AI)
  • PROBAST-AI37 (Prediction model Risk Of Bias ASsessment Tool-AI)
  • SPIRIT-AI39 (Standard Protocol Items: Recommendations for Interventional Trials-AI)
  • CONSORT-AI40 (Consolidated Standards of Reporting Trials-AI)
  • DECIDE-AI41 (Developmental and Exploratory Clinical Investigation of DEcision-support systems driven by Artificial Intelligence)
  • QUADAS-AI42 (QUality Assessment tool for artificial intelligence-centered Diagnostic test Accuracy Studies)
  • STANDING together43 (STANdards for Data INclusivity and Generalisability)

The limits of universal approaches to evaluating and validating AI

Interviewees for this research noted that while evidence standards set at the national or international levels can provide confidence in product efficacy, they will not negate the need for local validation of AI technologies.

The performance of AI systems can be strongly dependent on the details of the training data cohort and its similarity to the local cohort and situation.44 An AI technology can behave differently in a new context either in terms of technical performance or clinical impact. Therefore, when an AI technology is being deployed in a particular location or for a particular use case, additional local validation may be expected.

This suggests the need for local skills and capabilities to assess whether local validation of an AI technology is desirable and appropriate, and to conduct this validation where necessary. Some NHS Trusts have already established specialised teams to coordinate local validation of AI technologies as discussed also in section 4.3.

Information:

Evaluation and validation - Key confidence insights

  • The evidence and the methods used to determine how well an AI technology performs according to its intended use contribute to confidence in AI technologies. These include internal validation, external validation, local validation, and prospective clinical studies.
  • Prospective clinical studies are a particularly important driver for securing confidence in AI technologies that directly impact clinical decisions or patient outcomes.
  • Healthcare workers expect AI technologies to be held to the same evidence standards as other medical interventions such as pharmaceuticals.
  • Clear guidance is needed for evidence provided for AI technologies currently deployed in the NHS. The NICE evidence standards are being updated to include AI-specific guidance.
  • The use of regulatory certification as evidence of product efficacy is inadequate for algorithms used in clinical decision making.
  • To support evidence generation, IT and governance infrastructures within the NHS will need to support prospective research/clinical trials that involve AI technologies.

References

28 Hwang EJ, Park S, Jin KN, et al. Development and Validation of a Deep Learning-Based Automated Detection Algorithm for Major Thoracic Diseases on Chest Radiographs. JAMA Netw open. 2019;2(3):e191095. doi:10.1001/jamanetworkopen.2019.1095

29 Beede E, Baylor E, Hersch F, et al. A Human-Centered Evaluation of a Deep Learning System Deployed in Clinics for the Detection of Diabetic Retinopathy. In: Conference on Human Factors in Computing Systems - Proceedings. 2020. doi:10.1145/3313831.3376718

30 HM Government. The medical devices regulations 2002. 2002;(618):1-40. https://www.legislation.gov.uk/uksi/2002/618/contents/made. Accessed March 7, 2022.

31 Nagendran M, Chen Y, Lovejoy CA, et al. Artificial intelligence versus clinicians: Systematic review of design, reporting standards, and claims of deep learning studies in medical imaging. BMJ. 2020;368. doi:10.1136/bmj.m689

32 Liu X, Faes L, Kale AU, et al. A comparison of deep learning performance against health-care professionals in detecting diseases from medical imaging: a systematic review and meta-analysis. Lancet Digit Heal. 2019;1(6):e271-e297. doi:10.1016/S2589-7500(19)30123-2

33 Interim guidance on incorporating artificial intelligence into the NHS Breast Screening Programme. Gov.uk. https://www.gov.uk/government/publications/artificial-intelligence-in-the-nhs-breast-screening-programme/interim-guidance-on-incorporating-artificial-intelligence-into-the-nhs-breast-screening-programme. Published 2021. Accessed March 7, 2022.

9 Sinha S, Al Huraimel K. Transforming Healthcare with AI. In: Reimagining Businesses with AI; 2020:33-54. doi:10.1002/9781119709183.ch3

34 Gille F, Jobin A, Ienca M. What we talk about when we talk about trust: Theory of trust for AI in healthcare. Intell Med. 2020;1-2:100001. doi:10.1016/j.ibmed.2020.100001

35 NICE. Evidence standards framework for digital health technologies. 2019. https://www.nice.org.uk/about/what-we-do/our-programmes/evidence-standards-framework-for-digital-health-technologies. Accessed March 7, 2022.

36 NICE. NICE META Tool. https://meta.nice.org.uk/. Published 2021. Accessed March 7, 2022.

37 Collins GS, Dhiman P, Andaur Navarro CL, et al. Protocol for development of a reporting guideline (TRIPOD-AI) and risk of bias tool (PROBAST-AI) for diagnostic and prognostic prediction model studies based on artificial intelligence. BMJ Open. 2021;11(7):e048008. doi:10.1136/bmjopen-2020-048008

38 Sounderajah V, Ashrafian H, Golub RM, et al. Developing a reporting guideline for artificial intelligence-centred diagnostic test accuracy studies: The STARD-AI protocol. BMJ Open. 2021;11(6):e047709. doi:10.1136/bmjopen-2020-047709

39 Rivera SC, Liu X, Chan AW, Denniston AK, Calvert MJ. Guidelines for clinical trial protocols for interventions involving artificial intelligence: The SPIRIT-AI Extension. BMJ. 2020;370. doi:10.1136/bmj.m3210

40 Liu X, Cruz Rivera S, Moher D, et al. Reporting guidelines for clinical trial reports for interventions involving artificial intelligence: the CONSORT-AI extension. Nat Med. 2020;26(9):1364-1374. doi:10.1038/s41591-020-1034-x

41 Vasey B, Clifton DA, Collins GS, et al. DECIDE-AI: new reporting guidelines to bridge the development-to-implementation gap in clinical artificial intelligence. Nat Med. 2021;27(2):186-187. doi:10.1038/s41591-021-01229-5

42 Sounderajah V, Ashrafian H, Rose S, et al. A quality assessment tool for artificial intelligence-centered diagnostic test accuracy studies: QUADAS-AI. Nat Med. 2021;27(10):1663-1665. doi:10.1038/s41591-021-01517-0

43 STANDING Together Working Group. STANDING together. 2021. https://www.datadiversity.org/. Accessed March 8, 2022.

44 González-Gonzalo C, Thee EF, Klaver CCW, et al. Trustworthy AI: Closing the gap between development and integration of AI systems in ophthalmic practice. Prog Retin Eye Res. December 2021:101034. doi:10.1016/j.preteyeres.2021.101034

Page last reviewed: 12 April 2023
Next review due: 12 April 2024