This section outlines factors that can affect a clinician’s confidence in artificial intelligence (AI) during clinical reasoning and decision making (CRDM).

These involve the complex and nuanced interactions between the clinician’s personal attitudes and experiences, the clinical context, and the implications of the various characteristics of artificial intelligence (AI) models.

5.2.1 Impact of individual attitudes and experiences

Interviewees for this research noted that personal experiences and attitudes to innovation and AI technologies can strongly affect a clinician’s confidence in using AI technologies.

These factors correlate well with those identified in previous research concerning trust and AI technologies: 4,34,79

  • General digital literacy. Digital literacy is a key enabler in the adoption of new technologies. Literacy can vary across age groups, professional groups and places of work, with younger professionals in larger centres more likely to have good digital literacy in general.5
  • Familiarity with technology and computer systems in the workplace, and past experiences with AI or other innovations. Those more familiar with AI are generally more positive about its use in healthcare. However, a little familiarity can be dangerous, if it leads to an unquestioning preference for AI-derived information. 2,34,79
  • Personal attitudes towards technological innovations at work, including the perceived risks to an individual’s role through automation. Early computerised clinical tools have often performed poorly and generated more work for clinicians, leading to technology scepticism amongst some more experienced clinicians. Hesitance amongst healthcare workers may also be driven by media scepticism, reports of biased AI and data privacy concerns, as well as fears of losing their roles and responsibilities.79
  • In addition, several interviewees noted that ethical concerns around fairness and bias in the development and use of AI in healthcare can potentially affect confidence in these technologies.

Research has shown that clinicians are more likely to trust AI-derived information that does not relate to their area of expertise.80 Conversely, experts (specialists in their area of care) tend to be more sceptical, questioning AI-derived information more than generalists or more junior colleagues.68

Some experts interviewed for this research suggested that they can perceive AI-derived suggestions as ‘equivalent at best’ to their judgement, and potentially inferior in many situations. They may also have different perspectives of the impact of different decisions, compared to non-specialists. For example, they may view the importance of the AI-derived information differently when judging the performance of the AI technology based on clinically relevant endpoints (for example, impact on treatment or outcome), rather than numerical accuracy of the AI prediction itself.

However, it is important to note that the performance of human experts measured in trials or evaluation settings is rarely maintained with consistency in routine practice, due to quantity of work, fatigue and external distractions.81 As AI technologies do not suffer these human limitations, they may offer a benefit even in situations where experts currently perform these tasks.

Importantly, in many situations, not all clinical decisions are made by experts, and certainly not in the first instance. Supporting decision making in this context with AI-derived information could assist non-specialist clinicians in pressured scenarios and improve the quality and consistency of patient care. However, this may lead to non-specialists no longer having the opportunities to develop specialist skills, potentially resulting in de-skilling and a future skills shortage.

Interviewees for this research cautioned that non-specialist, or junior clinicians who need to make decisions on treatments and conditions that they have limited experience with, may tend to willingly accept and favour the support provided by an AI technology. This leads to the potential for uncritical adoption of AI predictions and resulting bias, a finding in agreement with previously published research.82

These variations in clinician experience and attitudes suggest several challenges.

  • Experts may be sceptical of the benefits of AI to their work, leading to missed opportunities to improve consistency and quality in CRDM.
  • Less experienced clinicians are potentially susceptible to inappropriate high levels of confidence in AI-derived information and may require education to be able to critically appraise this information.
  • Non-expert groups may come to rely on technologies that minimise the stress and complexity of their decision making, in place of developing their skills and ongoing learning, which will lead to de-skilling parts of the workforce.83

5.2.2 Impact of clinical context

The clinical context in which AI is used can influence confidence in the technologies and in the derived information, including in relation to the levels of clinical risk and human involvement.

Level of clinical risk

The level of involved clinical risk can influence confidence in AI technologies. Interviewees for this research predicted that healthcare workers will be more confident in using AI technologies that have lower clinical consequences, such as tools that prioritise certain cases for urgent review, but do not alter the clinical input that any patient ultimately receives. This is consistent with other research demonstrating that clinicians were more positive about using AI for administrative tasks rather than clinical tasks.84

This propensity suggests that AI technologies with a higher risk of clinical consequences may be adopted more cautiously, and that healthcare workers may expect a higher threshold for evidence of the safety and efficacy of these technologies. Interviewees cautioned that, in such high-risk instances, it is important to take a holistic view of the clinical situation and consider an AI prediction as one contributor amongst many to clinical decision making.

Level of clinician oversight

Feedback from healthcare workers and the public in other research suggest that AI-assisted CRDM technologies are more acceptable than technologies that act autonomously (referred to as autonomous AI), without human involvement in each decision. Tools that perform triage and result in patients not receiving review by a human clinician are examples of autonomous AI and should therefore be considered as ‘high-risk’.

These views are associated with a preference for a ‘human touch’ in the provision of healthcare services, both in relation to decision making and communication with patients.85,86

The preference for AI-assisted CRDM rather than autonomous AI may have contributed, along with commercial considerations, to a trend observed through the interviews conducted for this research: that industry innovators are following a ’human oversight’ approach when developing AI technologies. For example, AI technologies used for screening tasks are designed to be deployed alongside human graders rather than to replace human graders.

The reluctance to develop and deploy fully autonomous AI may also be due in part to the limited available evidence supporting their clinical efficacy. Unresolved governance issues including regulation and liability for autonomous AI technologies may add to this uncertainty, making organisations and individuals understandably more risk-averse in this context. These external factors are further discussed in sections 3.1 and 3.4

5.2.3 Impact of AI model design

Interviewees for this research described various characteristics of AI models that can influence confidence. These are described in detail in this section.

Interviewees noted that most healthcare workers do not have sufficient knowledge and understanding to ask meaningful questions about the features presented in this section. This suggests that part of the educational challenge around AI for CRDM is to equip clinicians to ask these questions, and to understand the significance of the answers across various clinical scenarios.

The nature and presentation of AI-derived information for CRDM

There are certain characteristics of AI-derived information that can make an assessment of appropriate confidence during CRDM challenging.

All AI systems use probabilistic methods to make predictions, regardless of whether they present that prediction categorically (for example, diagnosis A versus diagnosis B) or as a probability (for example, by providing the predicted probability of a particular diagnosis). While some conventional types of clinical information are also understood using probability (for example, a test result that has an associated sensitivity and specificity), AI technologies involve more complex and subtle statistical analyses that are harder to interpret and communicate effectively (described as the ‘black box’ problem of AI). This can lead to a reduced ability to understand the implications of a probabilistic output, and complicate determining the appropriate confidence in the information.87

AI technologies can be designed to include certain features with the potential to demystify and provide some degree of confidence as to how reliably they make predictions. These include features like uncertainty quantification, outlier detection and clarifying input data, as described in Box 5

Box 5: Features that can improve confidence in how AI algorithms make predictions

Uncertainty quantification

AI technologies may quantify indicators for uncertainty in their predictions. This can be shown in various ways, including confidence statements, scales, scores or even raw probabilities. These different ways of presenting uncertainty quantification can influence how users perceive and act on that information.

Clinical example: An AI product is designed to diagnose pneumonia on a chest X-Ray. When it provides the report, it states how confident it is that there is pneumonia on the X-ray. The user may see the following output- ‘diagnosis: pneumonia- high probability, whereas another could receive the following output- ‘diagnosis: pneumonia- low probability’.

The value of showing uncertainty quantification and the form in which it is displayed will vary depending on the task and the clinical setting. More research is needed to investigate how displaying uncertainty quantification can impact clinical decision making, and product-specific guidance may be required to assist clinicians to interpret this information.

Data integrity

It is possible for an AI to be incorrect but present high certainty in this incorrect output. This is likely to occur if the input data is further from the centre of the distribution of the training data, a situation known as ‘algorithmic brittle. Cases that are towards the limits of the distribution, known as edge cases, may occur because of continuously variable patient factors (for example, age, size, height or weight). 

Cases that are extremely dissimilar (several standard deviations) from the training distribution are known as outliers and are at high risk of erroneous predictions. Technical factors (for example, different radiology equipment used to take images, image brightness, resolution etc), or categorical patient features (gender, ethnicity, surgery, co-morbidity) can commonly cause outliers. In the extreme case, this type of data could be considered invalid input for the algorithm.

Outlier detection can mitigate against these situations, by estimating the similarity of the clinical case to the training data distribution and alerting the user to dissimilar cases. AI technologies can be designed to include outlier detection, increasing user confidence by ensuring that the AI technology will be able to highlight cases that may require additional clinical review and consideration.

Clinical example: An AI product has been designed to detect lung tumours in chest X-rays. A patient with a lobectomy is imaged. No lobectomy X-rays were included in the training data. The product identifies this image as unusual compared to its training data, and alerts the user that it should be considered less reliable than normal for this case.

Scope of input data

Knowledge of the input data and the factors being considered by the AI’s computational methods can enhance user confidence. Without this information, workers may be uncertain about the value of AI-derived information for particular patients.

Clinical example: An AI product is designed to predict the risk of skin cancer based on images of skin lesions. A primary care physician tests a lesion on a patient who has significant risk factors for skin cancer. The AI returns a ‘benign’ output. The confidence of the clinician trusting this output may be influenced by knowing what factors the AI has used to make the decision. For example, if the clinician doesn’t know whether the AI has taken the patient’s risk factors into account, they may be more likely to ignore the AI output and refer this lesion for review by secondary care; however, if they know the AI has considered these, they may be more likely to reassure the patient about the ‘benign’ output.

Interviewees noted that the way AI-derived information is presented can influence a clinician’s confidence in that information. This suggests, as also supported by literature,88,89,90 that careful user interface design, and the way in which AI-derived information is presented, can support assessment of appropriate confidence during CRDM.

AI-assisted CRDM can present a categorically different type of information to other forms of input, such as a numerical test result or patient history.

An AI technology may present to the clinician a ‘risk score’, suggested diagnosis or course of action, which can be perceived as synthesised information, similar to an opinion or the end result of the CRDM process rather than a conventional piece of factual input information. A clinician can associate a test result with a diagnosis or risk level, weighing other patient factors in the process, whereas the AI-derived information may already incorporate some or all this information. This can make assessing confidence in the AI-derived information more challenging, unless it is very clear what information has been considered by the algorithm, and what relative weighting it has been given.91

Example: A prostate cancer risk stratification AI, based on patient demographics and imaging, predicts a high risk that contradicts the clinician’s intuition. The patient is on an unusual medication that may affect the appearance of their prostate on imaging. The clinician, therefore, has reason to question the AI in this case and rightly has lower confidence in the AI risk score. They should default to making their own assessment of the imaging and associated patient factors.

Compared to those providing only categorical predictions, AI technologies that provide uncertainty quantifications can support confidence assessment for CRDM by giving users a sense of the algorithms’ internal certainty level. For example, if an algorithm is 80 per cent certain, then it should produce the correct prediction 80 per cent of the time. This is known as the probabilities being ‘well-calibrated’.

However, inaccurate calibration (for example, an algorithm estimating 80 per cent certainty, but when tested found to produce the correct prediction only 70 per cent of the time) has been shown to be common in AI, particularly with deep-neural networks.92 This is likely to lead to inappropriate levels of confidence, due to an impression that the probabilities will accurately indicate uncertainty for each case, when they may not. Therefore, before uncertainty quantification is used to support the assessment of appropriate confidence for AI-assisted CRDM, good calibration of these probabilities should be demonstrated in a validation cohort.

The impact of miscalibration in AI models has been shown to be exacerbated when non-expert users were involved in shared decision making, as the humans were unable to bring in enough unique knowledge to identify and correct the AI’s errors.93

Conversely, presenting AI predictions only categorically (for example, without uncertainty quantification) may lead to inappropriate assessments of this information. Interviewees were divided on whether categorical predictions were more likely to cause inappropriately high confidence (in the case of the predictions being taken at face value) or inappropriately low confidence (due to a perceived lack of nuance). They hypothesised that individual attitudes towards AI may influence how categorical predictions may be perceived and suggested that decisions around uncertainty quantification and clarity of information must be balanced to suit the use case. Some interviewees noted that they would always value additional contextual information, whereas others preferred a decisive, categorical recommendation over a probability, preferring the AI to simply refuse to offer an opinion at all, if internal uncertainty was too high. It was suggested that clinicians under time pressure, and especially when at the boundaries of their expertise, may prefer a ‘decisive’ presentation of information.

Interviewees concluded that the ideal approach to presenting AI-derived information depends on the clinical context, and should be considered carefully by algorithm and software designers, as well as implementors in health and care settings. Users should also be involved in the AI’s evaluation and implementation processes to determine the best approach for the presentation of the AI-derived information. Additional research is needed to fully understand and optimise the nature and timing of presentation of AI-derived information for CRDM across the range of relevant clinical settings and scenarios.

Further, interviewees suggested that AI-derived information may best be considered as if provided by an external specialist (for example a radiologist or pathologist), who will also typically have access to some but not all of the relevant clinical information and will have a different locus of expertise to the clinician themselves. In conventional CRDM processes, clinicians understand when they have additional information that influences the diagnosis or treatment choice, above that which the reporting specialist had access to. This same approach could be followed to assess the appropriate level of confidence in AI-derived information.

Explainability

Research suggests that transparency in how AI computes and delivers an output, and the possibility of proving a ‘human-like explanation’ for the prediction (referred to as explainability) encourages higher levels of confidence in the technology.94 This conclusion was supported by interviewees who highlighted a desire for explainability when designing and procuring AI technologies.

Some AI systems may be able to explicitly describe the importance being given to inputs or features of the data, enabling a direct explanation of their relative importance. However, many AI technologies, particularly those that use deep-learning, can be described as ‘black box’ as their algorithms do not inherently explain their ‘reasoning’. It has been argued that current machine-learning methods do not ‘reason’ like humans,95 but rather identify patterns in data and correlate them with previous human decisions or outcome labels, ‘learning’ the best predictive model for a human decision, but without the ’reasoning’ element. In machine learning, there is no guarantee that the features identified by the AI are the same ones the humans use in their reasoning processes, or that features that are shared with human observers are selected for the same reasons. There is a possibility that the AI may rely on features that are not appropriate to the task (such as laterality markers, or equipment characteristics in medical images),96 or detect features that are of completely unknown relevance to clinicians.

Explainable AI (XAI) involves approaches that attempt to ‘shine light’ into an AI ‘black-box’. XAI has been heralded as providing an intuitive picture of the features of the data and ‘reasoning’ that drive AI predictions, similar to asking a clinician what reason they have for a particular decision.97 

However, emerging evidence97-100 suggests that current XAI approaches to deep-learning models can be misleading for assessing individual predictions. The ’explanations’ provided can bear little relevance to finding out if the individual prediction was made for solid reasons.

XAI methods can only highlight salient parts of the input data and do not give clarity on the clinical relevance of the features to the individual prediction, or the ways in which they have been combined. XAI may therefore not correlate well with clinicians’ views of what an ‘explanation’ should be.98 There is some recent evidence99 that XAI techniques for image classification can actually be independent of the trained variables in the AI model (what the model has learned) and the data-label relationship (what the model tries to predict), instead merely highlighting ‘interesting’ parts of an image, such as edges and texturally rich areas which contain the most information. 

Further, research has found that labelling a product as XAI may make its predictions appear inappropriately trustworthy.100 XAI can potentially provide confidence that an algorithm is generally looking at the clinically relevant parts of an image (for example, lung boundaries for pneumothorax detection). However, XAI is not useful at an individual case level to determine the reliability of an AI prediction. 100,101

Example: A chest X-ray diagnostic algorithm for pneumothorax with Grad-CAM XAI provides associated saliency maps, which highlight certain areas of the lung that are ‘important’ to the prediction. However, the same regions are identified for most patients, irrespective of whether pneumothorax is present or where it is located in the image. The saliency map is identifying the image regions that are important at a model level, but not for the individual patient in question.

Some progress has been made102 in creating AI for chest X-ray screening that can provide ‘human-like’ explanations. The aspects of an image that contribute to a prediction of malignancy are highlighted and described in terms of their similarity to previous known-malignant examples. This process mimics how a human observer might explain their reasoning. This approach fundamentally differs from the common approach of adding post hoc explainability to a deep-learning AI model through saliency maps. The approach required a complete re-imagining and re-engineering of the AI architecture and training, to ensure it inherently makes predictions using ‘human-like reasoning’.

Some interviewees for this research were more likely to be sceptical of a ‘black-box’ AI than an XAI system. However, given the concerns around the suitability of XAI for individual case-level analysis, it may be inappropriate to consider XAI as a panacea for clinical confidence assessment at the individual prediction level. For individual predictions, it may be safer to assume that no ‘explanation’ is possible, and the system is treated as a ‘black-box’. This approach would mitigate the risk of inappropriate high level of confidence resulting from the ‘veneer of explainability’ provided by current XAI methods.100

Bias in AI models

Interviewees for this research were particularly concerned about the potential for bias in AI technologies, and how bias affects confidence in the performance of these technologies for different patients.

AI technologies are trained on data that reflect human decisions, which leads to the possibility of reinforcing or perpetuating biased, unfair or unethical tendencies amongst human operators.103 For example, AI systems that aim to triage patients or prioritise referrals, based on historical human decisions need to be carefully designed and validated to mitigate biases involving underlying access or communication challenges in certain patient demographics or cohorts.

Bias in AI algorithms has various sources,104 from bias in the demographic profile of the training data set, to the quality of the labelled data, the technical design of the model, and the team designing the algorithm.105 Box 6 provides some examples of these types of bias in AI models.

Box 6: Examples of AI biases

Data bias

Data bias is the root of much algorithmic bias in AI and can take several related forms.106 The most obvious is a simple imbalance of populations, or absence of a particular demographic, in a data set, leading to a lower performance for the minority group. However, the problems may be more subtle and complex. 

In an example107 of racial bias in a widely used risk-scoring algorithm, Black patients assigned the same level of risk by the algorithm were found to be sicker than White patients, reducing the number of Black patients identified for extra care by more than half. Bias occurred because the algorithm used health costs as a proxy for health needs. As less money is spent on Black patients who have the same level of need, due to health inequalities, the algorithm thus falsely concluded that Black patients are healthier than equally sick White patients and require less care. This is an example of poor AI design reinforcing and concretising health inequalities that are pre-existing in the current system.

A recent study108 examined the performance of a convolutional neural network classifier for skin lesion identification in a sub-Saharan African population. The researchers found overall very poor performance (accuracy of 17%) and determined this to be data bias driven in two ways. Firstly, the number of dark skin type images included in training was low and secondly the disease burden in the African population was significantly different to the western Caucasian population used in model development. Hence, many conditions present in that location were underrepresented or absent from the model.

Racial inequality is not the only risk deriving from data bias. Care must be taken around inequalities related to gender, age, (dis)ability, socio-economics, faith, geography, and any other demographic factors that could potentially disadvantage certain patients.

Some data sets used for healthcare AI model development may be completely inappropriate, due to confounding factors and lack of appropriate data collections methods, leading to serious generalisability concerns. A recent meta-analysis of coronavirus (COVID-19) CXR detection algorithms109 determined that many had been trained on publicly available ‘Frankenstein data sets’ where the positives were all from one healthcare site and the negatives from another. Most algorithms were found to be learning to identify positive cases based on non-clinical ’shortcuts’ (for example, style of laterality marker or site-specific scanner characteristics) rather than clinical signs of COVID pneumonia, leading to an almost complete lack of clinical usability, when taken outside the development data set. This situation can hardly be described as ‘bias’, but less severe examples could easily lead to uneven performance across sub-populations and hence strong biases.

A more subtle, but related form of data bias is in the application of well-validated AI technologies to a clinical situation that lies towards the edge of the algorithm’s experience. In such a case, the input data (patient cohort) is more likely to contain cases that are ‘outliers’ to the algorithm. This can lead to a dramatic fall in the performance of the AI technology, even following extensive external and local validation.110 In this context, a generally trustworthy AI algorithm may exhibit ‘brittleness’, by making untrustworthy predictions in some specific cases (the subset of patients who lie towards the edge or outside the training data distribution). This is a challenge that requires approaches such as outlier detection from the product point of view, as well as clinical vigilance from a user perspective (as discussed also in Box 5).

Team bias

The makeup of AI development teams can impact issues of fairness and bias in AI. The lack of gender and ethnic diversity in AI teams can contribute to perpetuating unconscious biases in research design and outcomes.105 This suggests that, ideally, teams should be diverse in their make-up and skillsets.

Other studies have recommended diversity and bias education for AI developers and embedding ethicists within AI development teams from the design phase onwards.111

Bias against user groups

Bias may occur in ways that affect the users of AI tools (healthcare workers) rather than patients directly. For example, speech recognition algorithms have been shown to be biased on gender112 and race,113 leading to reduced accuracy. In a healthcare setting, this increased the risk of error and hence has the potential to cause disadvantage or harm for patients based not on their own protected characteristics, but those of their clinician.

Reinforcement of outmoded practice

This bias can occur due to shifts in the landscape of evidence-based best practice. It is specific to recommendation engines (data filtering tools that suggest treatment strategies or care choices). Leveraging retrospective data for AI model training is attractive for practical and economic reasons, but treatment pathways and decisions which were made historically may no longer represent best practice. In this case, models may learn to recommend the out-of-date practice, and even up-to-date algorithms will not adapt to future changes in best practice. Model retraining and continuous evaluation against up-to-date standards is the only effective mitigation for this type of risk, but could be hampered by the lack of available, relevant data in the wake of a change in practice.114 

Potential biases in AI models and their implications for patient safety and fairness are a key reason that external validation is critical to building confidence and safety in AI for healthcare (as discussed in section 3.2). The concept of bias towards or against certain sub-populations in the clinical environment is closely related to the ‘generalisability’ of an algorithm,73 but it is important to dig deeper than the whole-cohort evaluation to determine whether certain groups may be disadvantaged or experience lower algorithm performance.

Creating well balanced, labelled data sets of sufficient size and quality can be extremely difficult (due to data unavailability) in healthcare, and the potential to embed bias into AI models is therefore high. For example, data collection can be imbalanced by the inclusion of rare diseases that are overrepresented in retrospective data sets.

The NHS AI Lab and the Health Foundation are currently supporting several projects to address these issues.115 The projects aim to identify and explore opportunities for using AI to mitigate or overcome existing healthcare inequalities, as well as address the data and model bias issues discussed in Box 3. Specifically, the STANdards for Data INclusivity and Generalisability (STANDING together) project aims to address these challenges in part by developing standards for data sets that AI systems use to ensure these systems are diverse, inclusive and work across all demographic groups.43

Interviewees for this research noted that, although bias in AI models is a significant concern, there is no universally accepted method for evaluating healthcare AI products for bias. Established protocols for investigating and mitigating bias during product evaluation and for reporting the potential for bias during deployment may assist with this. 

A recent review116 highlights the availability of AI-specific extensions to clinical trial reporting guidelines. These AI-specific guidelines (some listed in Box 6) are essential for the systematic assessment of the risk of bias in AI algorithms. Although they are not a measure of bias, they can estimate the risk of undetected bias, given a model and study design. Studies that score high against these guidelines could be considered the gold standard for evidence in healthcare AI research and evaluation.117

Looking beyond bias in AI technologies, human cognitive biases relating to clinical reasoning and decision making are also a challenge, as discussed in detail in section 5.3.

Information:

Factors affecting confidence in AI during CRDM - Key confidence insights

  • Personal experiences and attitudes to innovation and AI technologies can significantly impact a clinician’s confidence in using AI technologies.
  • Clinicians are more likely to trust AI-derived information that does not relate to their area of expertise. This suggests that less experienced clinicians are susceptible to inappropriately high levels of confidence in AI-derived information developed to support their decision making and may require education to critically appraise these technologies.
  • Conversely, experts (specialists in their area of care) tend to be more sceptical and question AI-derived information despite AI technologies having the potential to enhance their decision making performance in some situations.
  • The levels of clinical risk and human involvement associated with specific AI-assisted CRDM scenarios can influence confidence assessment for these technologies.
  • Clinicians need to be equipped to assess appropriate levels of confidence for each piece of AI-derived information. This assessment should be conducted per patient and with the necessary understanding of the strengths and limitations of the AI technologies involved.
  • Assessing appropriate levels of confidence in AI-derived information requires an understanding of the information that has been considered by the algorithm, and ideally what relative weighting it has been given. However, the technical features of ‘black-box’ AI technologies can limit such understanding. Further, AI-derived information for CRDM can be qualitatively different to conventional clinical information, both in complexity and the degree of synthesis in the predictions.
  • More research is needed to investigate how certain AI features influence confidence. For example, research is needed to investigate how uncertainty quantification can impact clinical decision making.
  • Explainable AI approaches (XAI) do not currently offer a panacea for assessing confidence in individual AI predictions, and may provide false reassurance to clinicians during CRDM.
  • The potential for bias in AI technologies is a major cause of concern and can affect confidence in these technologies. Established protocols for investigating and mitigating bias during product evaluation and for reporting the potential for bias during deployment may be needed. 

References

4 Spiegelhalter D. Should We Trust Algorithms? Harvard Data Sci Rev. January 2020:1-12. doi:10.1162/99608f92.cb91a35a

34 Gille F, Jobin A, Ienca M. What we talk about when we talk about trust: Theory of trust for AI in healthcare. Intell Med. 2020;1-2:100001. doi:10.1016/j.ibmed.2020.100001

79 Asan O, Bayrak AE, Choudhury A. Artificial Intelligence and Human Trust in Healthcare: Focus on Clinicians. J Med Internet Res. 2020;22(6):e15154. doi:10.2196/15154

5 Topol E. The Topol Review: Preparing the Healthcare Workforce to Deliver the Digital Future. 2019. https://topol.hee.nhs.uk/the-topol-review/. Accessed February 28, 2022.

2 Hardie T, Horton T, Willis M, Warburton W. Switched on. How Do We Get the Best out of Automation and AI in Health Care? 2021. doi:10.37829/HF-2021-I03

80 Petkus H, Hoogewerf J, Wyatt JC. What do senior physicians think about AI and clinical decision support systems: Quantitative and qualitative analysis of data from specialty societies. Clin Med J R Coll Physicians London. 2020;20(3):324-328. doi:10.7861/clinmed.2019-0317

68 Gaube S, Suresh H, Raue M, et al. Do as AI say: susceptibility in deployment of clinical decision-aids. npj Digit Med. 2021;4(1):1-8. doi:10.1038/s41746-021-00385-9

81 Westbrook JI, Raban MZ, Walter SR, Douglas H. Task errors by emergency physicians are associated with interruptions, multitasking, fatigue and working memory capacity: A prospective, direct observation study. BMJ Qual Saf. 2018;27(8):655-663. doi:10.1136/bmjqs-2017-007333

82 Larasati R, Liddo A De, Motta E. AI Healthcare System Interface: Explanation Design for Non-Expert User Trust. CEUR Workshop Proc. 2021;2903.

83 Macrae C. Governing the safety of artificial intelligence in healthcare. BMJ Qual Saf. 2019;28(6):495-498. doi:10.1136/bmjqs-2019-009484

84 Blease C, Bernstein MH, Gaab J, et al. Computerization and the future of primary care: A survey of general practitioners in the UK. PLoS One. 2018;13(12):e0207418. doi:10.1371/journal.pone.0207418

85 PWC. What doctor? What Dr. 2017;(June):1-50. http://medicalfuturist.com/. Accessed February 28, 2022.

86 Mori I. Public views of Machine Learning Findings from public research and engagement. 2017;(April). http://www.ipsos-mori.com/terms. Accessed February 28, 2022.

87 Holm S. Handle with care: Assessing performance measures of medical AI for shared clinical decision-making. Bioethics. 2022;36(2):178-186. doi:10.1111/bioe.12930

88 Bond RR, Mulvenna M, Wang H. Human centered artificial intelligence: Weaving UX into algorithmic decision making. RoCHI 2019 Int Conf Human-Computer Interact. 2019:2-9. https://hai.stanford.edu. Accessed March 8, 2022.

89 Buçinca Z, Malaya MB, Gajos KZ. To Trust or to Think: Cognitive Forcing Functions Can Reduce Overreliance on AI in AI-assisted Decision-making. 2021;5(April). doi:10.1145/3449287

90 Cai CJ, Winter S, Steiner D, Wilcox L, Terry M. “Hello AI”: Uncovering the onboarding needs of medical practitioners for human–AI collaborative decision-making. Proc ACM Human-Computer Interact. 2019;3(CSCW). doi:10.1145/3359206

91 Chari S, Seneviratne O, Gruen DM, Foreman MA, Das AK, McGuinness DL. Explanation Ontology: A Model of Explanations for User-Centered AI. In: Lecture Notes in Computer Science (Including Subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics). Vol 12507 LNCS. Springer Science and Business Media Deutschland GmbH; 2020:228-243. doi:10.1007/978-3-030-62466-8_15

92 Guo C, Pleiss G, Sun Y, Weinberger KQ. On calibration of modern neural networks. In: 34th International Conference on Machine Learning, ICML 2017. Vol 3. ; 2017:2130-2143.

93 Zhang Y, Vera Liao Q, Bellamy RKE. Efect of confidence and explanation on accuracy and trust calibration in AI-assisted decision making. In: FAT* 2020 - Proceedings of the 2020 Conference on Fairness, Accountability, and Transparency. ; 2020:295-305. doi:10.1145/3351095.3372852

94 Cutillo CM, Sharma KR, Foschini L, et al. Machine intelligence in healthcare—perspectives on trustworthiness, explainability, usability, and transparency. npj Digit Med. 2020;3(1):1-5. doi:10.1038/s41746-020-0254-2

95 Watson D. The Rhetoric and Reality of Anthropomorphism in Artificial Intelligence. Minds Mach. 2019;29(3):417-440. doi:10.1007/s11023-019-09506-6

96 Winkler JK, Fink C, Toberer F, et al. Association between Surgical Skin Markings in Dermoscopic Images and Diagnostic Performance of a Deep Learning Convolutional Neural Network for Melanoma Recognition. JAMA Dermatology. 2019;155(10):1135-1141. doi:10.1001/jamadermatol.2019.1735

97 Tjoa E, Guan C. A Survey on Explainable Artificial Intelligence (XAI): Toward Medical XAI. IEEE Trans Neural Networks Learn Syst. 2021;32(11):4793-4813. doi:10.1109/TNNLS.2020.3027314

98 Jin W, Li X, Hamarneh G. One Map Does Not Fit All: Evaluating Saliency Map Explanation on Multi-Modal Medical Images. July 2021. https://arxiv.org/abs/2107.05047v1. Accessed February 28, 2022.

99 Adebayo J, Gilmer J, Muelly M, Goodfellow I, Hardt M, Kim B. Sanity checks for saliency maps. In: Advances in Neural Information Processing Systems. Vol 2018-Decem. Neural information processing systems foundation; 2018:9505-9515. https://arxiv.org/abs/1810.03292v3. Accessed February 28, 2022.

100 Ghassemi M, Oakden-Rayner L, Beam AL. The false hope of current approaches to explainable artificial intelligence in health care. Lancet Digit Heal. 2021;3(11):e745-e750. doi:10.1016/s2589-7500(21)00208-9

101 Babic BB, Gerke S, Evgeniou T, Glenn Cohen I. Beware explanations from AI in health care the benefits of explainable artificial intelligence are not what they appear. Science. 2021;373(6552):284-286. doi:10.1126/science.abg1834

102 Chen C, Li O, Tao C, Barnett AJ, Su J, Rudin C. This looks like that: Deep learning for interpretable image recognition. In: Advances in Neural Information Processing Systems. Vol 32. Neural information processing systems foundation; 2019. https://arxiv.org/abs/1806.10574v5. Accessed February 28, 2022.

103 Yu KH, Kohane IS. Framing the challenges of artificial intelligence in medicine. BMJ Qual Saf. 2019;28(3):238-241. doi:10.1136/bmjqs-2018-008551

104 Cho MK. Rising to the challenge of bias in health care AI. Nat Med. 2021;27(12):2079-2081. doi:10.1038/s41591-021-01577-2

105 Zou J, Schiebinger L. Ensuring that biomedical AI benefits diverse populations. EBioMedicine. 2021;67. doi:10.1016/j.ebiom.2021.103358

106 Center for data ethics and innovation. Review into bias in algorithmic decision-making Centre for Data Ethics and Innovation. 2020

107 Obermeyer Z, Powers B, Vogeli C, Mullainathan S. Dissecting racial bias in an algorithm used to manage the health of populations. Science. 2019;366(6464):447-453. doi:10.1126/science.aax2342

108 Henry Kamulegeya L, Okello M, Mark Bwanika J, et al. Using artificial intelligence on dermatology conditions in Uganda: A case for diversity in training data sets for machine learning. bioRxiv. October 2019:826057. doi:10.1101/826057

109 Wynants L, Van Calster B, Collins GS, et al. Prediction models for diagnosis and prognosis of covid-19: Systematic review and critical appraisal. BMJ. 2020;369:26. doi:10.1136/bmj.m1328

110 Subbaswamy A, Adams R, Saria S. Evaluating Model Robustness and Stability to Dataset Shift. 2020;130. http://arxiv.org/abs/2010.15100. Accessed February 28, 2022.

111 McLennan S, Fiske A, Celi LA, et al. An embedded ethics approach for AI development. Nat Mach Intell. 2020;2(9):488-490. doi:10.1038/s42256-020-0214-1

112 Tatman R. Gender and Dialect Bias in YouTube’s Automatic Captions. In: EACL 2017 - Ethics in Natural Language Processing, Proceedings of the 1st ACL Workshop. 2017:53-59. doi:10.18653/v1/w17-1606

113 Koenecke A, Nam A, Lake E, et al. Racial disparities in automated speech recognition. Proc Natl Acad Sci U S A. 2020;117(14):7684-7689. doi:10.1073/pnas.1915768117

114 Challen R, Denny J, Pitt M, Gompels L, Edwards T, Tsaneva-Atanasova K. Artificial intelligence, bias and clinical safety. BMJ Qual Saf. 2019;28(3):231-237. doi:10.1136/bmjqs-2018-008370

73 Kelly CJ, Karthikesalingam A, Suleyman M, Corrado G, King D. Key challenges for delivering clinical impact with artificial intelligence. BMC Med. 2019;17(1):1-9. doi:10.1186/s12916-019-1426-2

115 The AI Ethics Initiative - NHS AI Lab programmes - NHS Transformation Directorate. https://www.nhsx.nhs.uk/ai-lab/ai-lab-programmes/ethics/. Accessed March 8, 2022.

43 STANDING Together Working Group. STANDING together. 2021. https://www.datadiversity.org/. Accessed March 8, 2022.

116 Ibrahim H, Liu X, Denniston AK. Reporting guidelines for artificial intelligence in healthcare research. Clin Exp Ophthalmol. 2021;49(5):470-476. doi:10.1111/ceo.13943

117 McCradden MD, Joshi S, Anderson JA, Mazwi M, Goldenberg A, Shaul RZ. Patient safety and quality improvement: Ethical principles for a regulatory approach to bias in healthcare machine learning. J Am Med Informatics Assoc. 2020;27(12):2024-2027. doi:10.1093/jamia/ocaa085

Page last reviewed: 13 April 2023
Next review due: 13 April 2024