3.2 Evaluation and validation
Confidence that artificial intelligence (AI) technologies work in real-world clinical settings.
This section discusses the required evidence and the processes used to determine how well an artificial intelligence (AI) technology performs according to its intended use (referred to as efficacy). Interviewees for this research highlighted that these contribute to establishing the trustworthiness of, and hence increasing confidence in AI technologies.
Formal requirements for evidence in AI technologies are still being developed.
Evidence of an AI’s efficacy can be captured in several stages. AI technologies can progress through these stages during their development and deployment.
- Internal validation: The AI model is tested by its developer using a separate validation data set, often split from the same source as the training data set. It generally uses retrospective data sets (data that has been collected in the past).
- External validation: The AI model is tested with data from a different source to the training data set. It tests the generalisability of the model’s performance to clinically relevant scenarios, ensuring the performance that has been validated internally is maintained. External validation may be performed by the AI developer, or independently by a third party.
- Local validation: In some cases, limited further external validation may be needed as part of deploying AI at a local setting, to ensure its performance translates to the local data, patient populations and clinical scenarios. Procurement or commissioning entities may decide that the available evidence for an AI technology does not provide sufficient confidence that its performance will be acceptable in the local situation (local validation is discussed in section 4.3).
- Prospective clinical studies: The AI model is tested in a real-world clinical setting using data collected in real time. This evaluation determines whether the technology has benefits in terms of efficiency or patient outcomes. It involves testing the AI’s technical performance as well as its integration with clinical workflows.
- Ongoing monitoring: Healthcare settings should monitor the performance of AI algorithms in use to ensure there is no degradation due to population drift or technical factors elsewhere in the data pipeline.
Research has shown that many AI models that perform well at the internal validation stage perform significantly worse at the external validation stage.28 This may be due to the model being insufficiently ‘generalisable’; in other words, the model does not replicate its success when given data different to its original source.
Further, products that perform well on both external and internal validation can perform poorly in prospective clinical studies.29 There can be many reasons for this poor performance, including human factors, technology infrastructure, and systems considerations.
Current approaches
Interviewees for this research noted that uncertainty in the appropriate evidence required for AI technologies can have a negative impact on the confidence of those commissioning and using AI technologies.
Current Medicines and Healthcare products Regulatory Agency (MHRA) guidelines for UK Conformity Assessment (UKCA) regulatory approval stipulate the need for internal validation testing. Although the requirement for external validation (using data from a different source to the training data) may be inferred from the UKCA clinical evaluation requirements, there is currently no requirement for this validation to be conducted by an independent third party.30 Prospective clinical studies are not currently a requirement for regulatory approval for software as a medical device (SaMD).
Interviewees for this research, supported by literature, observed that although most AI technologies being considered by NHS settings have gone through internal validation, few have been externally validated by third parties or undergone prospective clinical studies.31,32 Interviewees observed that very few industry innovators are committed to long-term prospective clinical studies (which demand time and monetary resources). Instead, many present regulatory certifications (such as UKCA or CE certification) as evidence of their technology’s performance.
Procurers of AI technologies and academics interviewed for this research suggested that the use of regulatory certification as evidence of product efficacy is inadequate for algorithms used in clinical decision making. They recommended that external validation and prospective clinical studies be required for UKCA approval, with evidence made publicly available.
Additional evidence is already required for AI used in national screening programmes. The UK National Screening Committee stipulates that AI screening technology must undergo external validation with further clinical evaluation being required in certain circumstances.33
Interviewees, supported by research, highlighted that prospective clinical studies are a particularly important driver for establishing confidence in AI technology that directly impacts clinical decisions or patient outcomes.9,34 These prospective studies can identify the system benefits, such as saving money, freeing up staff time for other tasks, or enabling a shift from inpatient to outpatient care.
Clinician interviewees for this research perceived that AI technologies used in patient care should be held to the same evidence standards as other medical interventions, such as pharmaceuticals. This includes systematic reviews, randomised controlled trials (RCTs) and peer review research. Revision and redesign of existing information technology (IT) and data governance infrastructures may be needed to support this level of evidence generation.
On the flip side, industry innovators interviewed for this research expressed concerns that traditional prospective clinical study methods such as RCTs require significant financial investment and long-time scales. They suggested that alternative evidence generation methods could be considered to ensure the healthcare system can benefit from new technologies sooner rather than later.
Several standards and tools have or are being developed for medical devices and clinical research to address the issues of varying evidence standards for AI products. Box 3 provides a few examples.
The limits of universal approaches to evaluating and validating AI
Interviewees for this research noted that while evidence standards set at the national or international levels can provide confidence in product efficacy, they will not negate the need for local validation of AI technologies.
The performance of AI systems can be strongly dependent on the details of the training data cohort and its similarity to the local cohort and situation.44 An AI technology can behave differently in a new context either in terms of technical performance or clinical impact. Therefore, when an AI technology is being deployed in a particular location or for a particular use case, additional local validation may be expected.
This suggests the need for local skills and capabilities to assess whether local validation of an AI technology is desirable and appropriate, and to conduct this validation where necessary. Some NHS Trusts have already established specialised teams to coordinate local validation of AI technologies as discussed also in section 4.3.
Evaluation and validation - Key confidence insights
- The evidence and the methods used to determine how well an AI technology performs according to its intended use contribute to confidence in AI technologies. These include internal validation, external validation, local validation, and prospective clinical studies.
- Prospective clinical studies are a particularly important driver for securing confidence in AI technologies that directly impact clinical decisions or patient outcomes.
- Healthcare workers expect AI technologies to be held to the same evidence standards as other medical interventions such as pharmaceuticals.
- Clear guidance is needed for evidence provided for AI technologies currently deployed in the NHS. The NICE evidence standards are being updated to include AI-specific guidance.
- The use of regulatory certification as evidence of product efficacy is inadequate for algorithms used in clinical decision making.
- To support evidence generation, IT and governance infrastructures within the NHS will need to support prospective research/clinical trials that involve AI technologies.
References
28 Hwang EJ, Park S, Jin KN, et al. Development and Validation of a Deep Learning-Based Automated Detection Algorithm for Major Thoracic Diseases on Chest Radiographs. JAMA Netw open. 2019;2(3):e191095. doi:10.1001/jamanetworkopen.2019.1095
29 Beede E, Baylor E, Hersch F, et al. A Human-Centered Evaluation of a Deep Learning System Deployed in Clinics for the Detection of Diabetic Retinopathy. In: Conference on Human Factors in Computing Systems - Proceedings. 2020. doi:10.1145/3313831.3376718
30 HM Government. The medical devices regulations 2002. 2002;(618):1-40. https://www.legislation.gov.uk/uksi/2002/618/contents/made. Accessed March 7, 2022.
31 Nagendran M, Chen Y, Lovejoy CA, et al. Artificial intelligence versus clinicians: Systematic review of design, reporting standards, and claims of deep learning studies in medical imaging. BMJ. 2020;368. doi:10.1136/bmj.m689
32 Liu X, Faes L, Kale AU, et al. A comparison of deep learning performance against health-care professionals in detecting diseases from medical imaging: a systematic review and meta-analysis. Lancet Digit Heal. 2019;1(6):e271-e297. doi:10.1016/S2589-7500(19)30123-2
33 Interim guidance on incorporating artificial intelligence into the NHS Breast Screening Programme. Gov.uk. https://www.gov.uk/government/publications/artificial-intelligence-in-the-nhs-breast-screening-programme/interim-guidance-on-incorporating-artificial-intelligence-into-the-nhs-breast-screening-programme. Published 2021. Accessed March 7, 2022.
9 Sinha S, Al Huraimel K. Transforming Healthcare with AI. In: Reimagining Businesses with AI; 2020:33-54. doi:10.1002/9781119709183.ch3
34 Gille F, Jobin A, Ienca M. What we talk about when we talk about trust: Theory of trust for AI in healthcare. Intell Med. 2020;1-2:100001. doi:10.1016/j.ibmed.2020.100001
35 NICE. Evidence standards framework for digital health technologies. 2019. https://www.nice.org.uk/about/what-we-do/our-programmes/evidence-standards-framework-for-digital-health-technologies. Accessed March 7, 2022.
36 NICE. NICE META Tool. https://meta.nice.org.uk/. Published 2021. Accessed March 7, 2022.
37 Collins GS, Dhiman P, Andaur Navarro CL, et al. Protocol for development of a reporting guideline (TRIPOD-AI) and risk of bias tool (PROBAST-AI) for diagnostic and prognostic prediction model studies based on artificial intelligence. BMJ Open. 2021;11(7):e048008. doi:10.1136/bmjopen-2020-048008
38 Sounderajah V, Ashrafian H, Golub RM, et al. Developing a reporting guideline for artificial intelligence-centred diagnostic test accuracy studies: The STARD-AI protocol. BMJ Open. 2021;11(6):e047709. doi:10.1136/bmjopen-2020-047709
39 Rivera SC, Liu X, Chan AW, Denniston AK, Calvert MJ. Guidelines for clinical trial protocols for interventions involving artificial intelligence: The SPIRIT-AI Extension. BMJ. 2020;370. doi:10.1136/bmj.m3210
40 Liu X, Cruz Rivera S, Moher D, et al. Reporting guidelines for clinical trial reports for interventions involving artificial intelligence: the CONSORT-AI extension. Nat Med. 2020;26(9):1364-1374. doi:10.1038/s41591-020-1034-x
41 Vasey B, Clifton DA, Collins GS, et al. DECIDE-AI: new reporting guidelines to bridge the development-to-implementation gap in clinical artificial intelligence. Nat Med. 2021;27(2):186-187. doi:10.1038/s41591-021-01229-5
42 Sounderajah V, Ashrafian H, Rose S, et al. A quality assessment tool for artificial intelligence-centered diagnostic test accuracy studies: QUADAS-AI. Nat Med. 2021;27(10):1663-1665. doi:10.1038/s41591-021-01517-0
43 STANDING Together Working Group. STANDING together. 2021. https://www.datadiversity.org/. Accessed March 8, 2022.
44 González-Gonzalo C, Thee EF, Klaver CCW, et al. Trustworthy AI: Closing the gap between development and integration of AI systems in ophthalmic practice. Prog Retin Eye Res. December 2021:101034. doi:10.1016/j.preteyeres.2021.101034
Page last reviewed: 12 April 2023
Next review due: 12 April 2024