The What of Explainable AI
Bahador Khaleghi Bahador Khaleghi
September 3 10 min

The What of Explainable AI

In the previous entry, we explained why Explainable AI is important. Now, we turn to what it actually means to explain an AI model. That turns out to be a complicated question. There is no commonly agreed-upon definition of Explainable AI and what it means in theory and practice. This is our attempt to round up most of the competing explanations and explain the “What” of Explainable AI.

What does it really mean to explain AI?

What it means to explain a model can vary widely depending on the end user’s role and level of sophistication, as well as specific constraints of given application environment. For example, a developer creating an AI model might prefer explanations that describe the inner workings of the model to facilitate debugging. On the other hand, a lay user who is examining the fairness of an AI model might prefer explanations that relate its input to its output.

Tomsett et al. describe a model that is intended to identify various agent roles in a machine learning ecosystem. They define six different roles, namely, creators, data-subjects, operators, executors, examiners, and decision-subjects. Creators are agents that either implement or own the machine learning system. Data-subjects are agents whose personal data has been used to train the machine learning models. Operators are agents who directly interact with the machine learning system whereas executors are agents that make decisions while informed by the operators. Agents who audit the machine learning system are examiners. Finally, agents who are affected by the decision(s) made by the executor(s) are decision-subjects. Relying on this categorization, the authors present several example scenarios and discuss specific explainability needs of each of these six roles for each case.

The XAI community has yet to reach a consensus on the definition of AI explainability. XAI can be defined as developing models that are inherently easier to understand for their (human) users, namely, model explainability. Alternatively, XAI can be defined as the process of extracting some form of explanations from complex pre-developed models that are otherwise difficult (if not impossible) to understand for their users, namely, post-hoc explainability.

Through model or post-hoc explainability, users might be able to understand how a model makes its predictions. Yet they might still be unable to understand the AI system, as a whole, if the data representation used by its underlying model is not explainable to them. For example, if input features used by the model are computed using complex mathematical formulas with no intuitive meaning. In other words, for an AI system to be explainable, its underlying data processing pipeline should be somewhat explainable.

The aforementioned lack of definition for XAI is a consequence of it being a collection of interrelated problems rather than a single problem. What these problems all have in common is the objective of making some component of the data processing pipeline understandable to users.

What it means to
An illustration of various roles in a machine learning ecosystem, each with a potentially different explainability need. For instance, model examiners often require explanations that relate model input to its output, whereas model creators might require explanations that describe the inner workings of the model.

Model explainability

In his seminal article on model interpretability, Zack Lipton proposes a taxonomy of model explainability according to the three levels at which transparency (to the user) is achieved: simulatability, decomposability, and algorithmic transparency.

A model is simulatable if a human user can comprehend the entire model at once. This typically means a model is simple enough either in terms of its size, e.g. sparse linear models, or computation required to perform predictions, e.g. shallow decision trees.

A model is decomposable if each of its parts, e.g. input, parameters, and calculations, admit an intuitive explanation. For instance, the coefficient parameters of a linear model can be explained as representing association strength between each input feature and the model output.

Lastly, algorithmic transparency means that the training process used to develop a model is well understood. For example, training linear models is known to always result in convergence to a unique solution. The heuristic optimization procedures used to train modern deep models, however, lack such algorithmic transparency.

AI model explainability.
Three levels of model transparency as proposed by Z. Lipton: Simulatability, Decomposability and Algorithmic Transparency.

Complete vs interpretable explanations

Gilpin et al. describe some of the foundational concepts of AI explainability and use them to classify the literature. In particular, they point out a trade-off between completeness and interpretability of explanations. Ideally, we want explanations of a model to describe its operation as accurately as possible (completeness) while also being understandable to humans (interpretable). These two objectives often seem to be at odds with each other, as higher interpretability often requires simpler explanations that, by definition, are less capable of faithfully describing complex models.

The authors caution against relying too much on a human-based evaluation of explanations, as it could lead to preferring interpretable explanations over complete ones. In other words, simply because an explanation provided for a model prediction seems persuasive to a user doesn’t necessarily mean it reflects what the model is doing.

Complete vs interpretable explanations
Interpretability vs completeness tradeoff of explanations: oftentimes the more human-interpretable an explanation, the lower the fidelity its description of a model’s operation.

In fact, an extensive study of deep neural networks explainability methods corroborates the idea of explainability methods generating incomplete explanations. The authors show the feasibility of performing what they call “dual adversarial” attacks. These are special adversarial attacks where model input is manipulated to yield invalid predictions while also keeping the explanations generated for those predictions intact. The authors speculate the reason such attacks are possible is that explanations generated for model prediction are only partially describing how the model works. This lack of explanation completeness can, in turn, be caused by “the over-reliance on visual assessment (by humans)” to assess the quality of explanations.

One way to mitigate the issue of over-reliance on a human-based evaluation of explanations could be to separate the tasks of extracting descriptive and persuasive explanations. For instance, one might first use a decision tree of depth 40 as a descriptive explanation of a deep neural network. Next, the decision tree can be truncated, by removing nodes deeper than 10, to elicit a persuasive explanation.

An example of benign, adversarial, and dually adversarial inputs
An example of benign, adversarial, and dually adversarial inputs generated for the RESNET image classifier and the CAM explainability method. Dually adversarial attacks generate inputs that manipulate a model prediction while keeping its explanation intact.

A possible formal definition of explainability

The framework proposed by Dhurandhar et al. presents a formalism of explainability. The authors propose to define explainability relative to a target model applied to a given task, as opposed to an absolute concept. In particular, explainability is defined as a process where some information is extracted from a complex model and communicated to a target model, which is often a human, to improve performance. More specifically, a procedure P is defined to be δ-explainable if it derives and communicates information I from a complex (unexplainable) model to a target model that results in the target model’s expected error (for a given task) to improve by a factor of δ. In other words, the smaller the δ, the more explainable the procedure P. Interestingly, this definition does not require the target model to necessarily be a human. In practice, it can also be any model that is considered explainable by humans, e.g. a linear model or a decision tree. Another advantage of this framework is that it makes it straightforward to compare different explainability methods based on their relative target model performance gain. Finally, the authors show the flexibility of their framework by extending it to account for model robustness.

A possible formal definition of AI explainability.
A depiction of a δ-explainable procedure where information I is derived from a complex model and communicated to a target model (often a human) to improve its performance by a factor of δ.

Explainability studies beyond the AI community

Alan Cooper, one of the pioneers of software interaction design, argues in his book The Inmates Are Running the Asylum that the main reason for poor user experience in software is programmers designing it for themselves rather than their target audience. This phenomenon gives the book its title and Cooper calls for design with the needs of the end user in mind.

Drawing analogies with Cooper’s work, Tim Miller et al. warn the XAI community about their field risking the same fate as the software development industry. They point out that most of the work on XAI, including its definition and objectives, is driven only by AI researchers, despite the fact that the notion of explainability has been studied extensively by other research communities including social and behavioural sciences. The authors argue that this ignorance has led to the development of explanatory agents for AI researchers, rather than intended human users. Accordingly, they urge the XAI community to infuse more learnings from the vast body of research in social sciences to better understand how humans’ cognitive biases and social expectations colour their perception of explanations.

For example, one of the major learnings of social science studies of human explanations is that they are often contrastive, namely, describe what caused a given outcome, known as fact, instead of another (expected) counterfactual outcome, often called foil. This is both a challenge and an opportunity for XAI. It’s a challenge because often a foil outcome is only implied and thus must be determined first. It is also an opportunity as, arguably, providing a contrastive explanation, rather than a full causal explanation, might be computationally easier. The notion of counterfactual explanations is highly related to the contrastive nature of explanations. Counterfactual explanations describe the smallest change to the feature values required to change the prediction to a predefined, alternative output.

What’s next?

For the XAI field to make more meaningful progress, it requires more work to form an agreed-upon definition of its fundamental notions. Our review of the literature shows that what it means to explain an AI model is context and application dependent. That being said, it might be possible to more formally define explanations relative to an application. Finally, the AI community should broaden their understanding of explainability by borrowing from the vast related literature in social and behavioural sciences.

We invite the more technical audience, anyone interested in gaining a deeper understanding of XAI, to check out the author’s blog for a three-part series on the How of XAI. The series presents a systematic review of some of the most influential XAI methodologies that can be applied before, during, and after the modelling stage of the AI development pipeline.


Special thanks to Xavier Snelgrove, Elnaz Barshan, Lindsay Brin, Santiago Salcido, Manon Gruaz, Genevieve Merat, Simon Hudson, and Jean-Philippe Reid for valuable comments and illustrations. Edited by Peter Henderson.