Inference for model-agnostic variable importance

class: center, middle, title-slide

# Inference for model-agnostic variable importance
### Brian D. Williamson, PhD<br> <span style="font-size: 75%;"> Kaiser Permanente Washington Health Research Institute </span>
### 21 June, 2022 <br> <a href = 'https://bdwilliamson.github.io/talks' style = 'color: white;'>https://bdwilliamson.github.io/talks</a>

---

## Acknowledgments

This work was done in collaboration with:
<img src="img/people1.PNG" width="65%" style="display: block; margin: auto;" />
<img src="img/people3.png" width="50%" style="display: block; margin: auto;" />

---

## Variable importance: what and why

**What is variable importance?**

* .blue1[Quantification of "contributions" of a variable] (or a set of variables)

Traditionally: contribution to .blue2[predictions]
--

* Useful to distinguish between contributions of predictions...
--

* (.blue1[extrinsic importance]) ... .blue1[by a given (possibly black-box) algorithm]
  .small[ [e.g., Breiman, (2001)] ]
--

* (.blue1[intrinsic importance]) ... .blue1[by best possible (i.e., oracle) algorithm]
  .small[ [e.g., van der Laan (2006)] ]
--

* Our work focuses on .blue1[interpretable, model-agnostic intrinsic importance]

Example uses of .blue2[intrinsic] variable importance:
* is it worth extracting text from notes in the EHR for the sake of predicting hospital readmission?

* is it worth collecting a given covariate for the sake of predicting neutralization sensitivity?

---

## Case study: ANOVA importance

Data unit `$(X, Y) \sim P_0$` with:
* outcome `$Y$` 
* covariate `$X := (X_1, X_2, \ldots, X_p)$`

**Goals:** 
* .green[estimate]
* .blue1[and do inference on]

the importance of `$(X_j: j \in s)$` in predicting `$Y$`

How do we typically do this in **linear regression**?

---

## Case study: ANOVA importance

How do we typically do this in **linear regression**?

* Fit a linear regression of `$Y$` on `$X$` `$\rightarrow \color{magenta}{\mu_n(X)}$`
--

* Fit a linear regression of `$Y$` on `$X_{-s}$` `$\rightarrow \color{magenta}{\mu_{n,-s}(X)}$`
--

* .green[Compare the fitted values] `$[\mu_n(X_i), \mu_{n,-s}(X_i)]$`

Many ways to compare fitted values, including:
* ANOVA decomposition
* Difference in `$R^2$`

---

## Case study: ANOVA importance

Difference in `$R^2$`: `$$\left[1 - \frac{n^{-1}\sum_{i=1}^n\{Y_i - \mu_n(X_i)\}^2}{n^{-1}\sum_{i=1}^n\{Y_i - \overline{Y}_n\}^2}\right] - \left[1 - \frac{n^{-1}\sum_{i=1}^n\{Y_i - \mu_{n,-s}(X_i)\}^2}{n^{-1}\sum_{i=1}^n\{Y_i - \overline{Y}_n\}^2}\right]$$`

&zwj;Inference:
* Test difference
* Valid confidence interval

---

## Case study: ANOVA importance

Consider the .blue1[population parameter] `$$\psi_{0,s} = \frac{E_0\{\mu_0(X) - \mu_{0,-s}(X)\}^2}{var_0(Y)}$$`

* `$\mu_0(x) := E_0(Y \mid X = x)$` .blue1[(true conditional mean)]
* `$\mu_{0,-s}(x) := E_0(Y \mid X_{-s} = x_{-s})$`

[for a vector `$z$`, `$z_{-s}$` represents `$(z_j: j \notin s)$`]

* .blue2[nonparametric extension] of linear regression-based ANOVA parameter

* Can be expressed as a `$\color{magenta}{\text{difference in population } R^2}$` values, since `$$\color{magenta}{\psi_{0,s} = \left[1 - \frac{E_0\{Y - \mu_0(X)\}^2}{var_0(Y)}\right] - \left[1 - \frac{E_0\{Y - \mu_{0,-s}(X)\}^2}{var_0(Y)}\right]}$$`

---

## Case study: ANOVA importance

How should we make inference on `$\psi_{0,s}$`?
--

1. construct estimators `$\mu_n$`, `$\mu_{n,-s}$` of `$\mu_0$` and `$\mu_{0,-s}$` (e.g., with machine learning)
--

2. plug in: `$$\psi_{n,s} := \frac{\frac{1}{n}\sum_{i=1}^n \{\mu_n(X_i) - \mu_{n,-s}(X_i)\}^2}{\frac{1}{n}\sum_{i=1}^n (Y_i - \overline{Y}_n)^2}$$`
--

but this estimator has .red[asymptotic bias]
--

3. using influence function-based debiasing [e.g., Pfanzagl (1982)], we get estimator `$$\color{magenta}{\psi_{n,s}^* := \left[1 - \frac{\frac{1}{n}\sum_{i=1}^n\{Y_i - \mu_n(X_i)\}^2}{\frac{1}{n}\sum_{i=1}^n (Y_i - \overline{Y}_n)^2}\right] - \left[1 - \frac{\frac{1}{n}\sum_{i=1}^n\{Y_i - \mu_{n,-s}(X_i)\}^2}{\frac{1}{n}\sum_{i=1}^n (Y_i - \overline{Y}_n)^2}\right]}$$`

---

## Case study: ANOVA importance

`$$\color{magenta}{\psi_{n,s}^* := \left[1 - \frac{\frac{1}{n}\sum_{i=1}^n\{Y_i - \mu_n(X_i)\}^2}{\frac{1}{n}\sum_{i=1}^n (Y_i - \overline{Y}_n)^2}\right] - \left[1 - \frac{\frac{1}{n}\sum_{i=1}^n\{Y_i - \mu_{n,-s}(X_i)\}^2}{\frac{1}{n}\sum_{i=1}^n (Y_i - \overline{Y}_n)^2}\right]}$$`

Key observations:
* `$\psi_{n,s}^* =$` plug-in estimator of `$\psi_{0,s}$` based on difference-in- `$R^2$` representation
--

* .blue1[No need to debias] the difference-in- `$R^2$` estimator!
--

* Why does this happen?

.blue2[Estimation of] `$\mu_{0}$` .blue2[and] `$\mu_{0,-s}$` .blue2[yields only second-order terms, so estimator behaves as if they are **known**]
  
--

Under regularity conditions, `$\psi_{n,s}^*$` is consistent and nonparametric efficient.

In particular, `$\sqrt{n}(\psi_{n,s}^* - \psi_{0,s})$` has a mean-zero normal limit with estimable variance.

[Details in Williamson et al. (2020)]
---

## Preparing for AMP

* 611 HIV-1 pseudoviruses
* Outcome: neutralization sensitivity/resistance to antibody

**Goal:** pre-screen features for inclusion in secondary analysis
* 800 individual features, 13 groups of interest

&zwj;Procedure: 
1. Estimate `$\mu_n$`, `$\mu_{n,-s}$` using Super Learner [van der Laan et al. (2007)]
2. Estimate and do inference on variable importance `$\psi_{n,s}^*$`

.small[ [Details in Magaret et al. (2019) and Williamson et al. (2021b)] ]

---

## Preparing for AMP: R-squared

---

## Generalization to arbitrary measures

ANOVA example suggests a natural generalization:
--

* Choose a relevant measure of .blue1[predictiveness] for the task at hand

* `$V(f, P) =$` .blue1[predictiveness] of function `$f$` under sampling from `$P$`
  * `$\mathcal{F} =$` rich class of candidate prediction functions
  * `$\mathcal{F}_{-s} =$` {all functions in `$\mathcal{F}$` that ignore components with index in `$s$`} `$\subset \mathcal{F}$`
  
--

* Define the oracle prediction functions

`$f_0:=$` maximizer of `$V(f, P_0)$` over `$\mathcal{F}$` & `$f_{0,-s}:=$` maximizer of `$V(f, P_0)$` over `$\mathcal{F}_{-s}$`

Define the importance of `$(X_j: j \in s)$` relative to `$X$` as `$$\color{magenta}{\psi_{0,s} := V(f_0, P_0) - V(f_{0,-s}, P_0) \geq 0}$$`

---

## Generalization to arbitrary measures

Some examples of predictiveness measures:

(arbitrary outcomes)

&zwj; `$R^2$`: `$V(f, P) = 1 - E_P\{Y - f(X)\}^2 / var_P(Y)$`

(binary outcomes)

Classification accuracy: `$V(f, P) = P\{Y = f(X)\}$`

&zwj;AUC: `$V(f, P) = P\{f(X_1) < f(X_2) \mid Y_1 = 0, Y_2 = 1\}$` for `$(X_1, Y_1) \perp (X_2, Y_2)$`

Pseudo- `$R^2$` : `$1 - \frac{E_P[Y \log f(X) - (1 - Y)\log \{1 - f(X)\}]}{P(Y = 1)\log P(Y = 1) + P(Y = 0)\log P(Y = 0)}$`

---

## Generalization to arbitrary measures

How should we make inference on `$\psi_{0,s}$`?
--

1. construct estimators `$f_n$`, `$f_{n,-s}$` of `$f_0$` and `$f_{0,-s}$` (e.g., with machine learning)
--

2. plug in: `$$\psi_{n,s}^* := V(f_n, P_n) - V(f_{n,-s}, P_n)$$`
  
  where `$P_n$` is the empirical distribution based on the available data
--

3. Inference can be carried out using influence functions.
--
 Why?

We can write `$V(f_n, P_n) - V(f_{0}, P_0) \approx \color{green}{V(f_0, P_n) - V(f_0, P_0)} + \color{blue}{V(f_n, P_0) - V(f_0, P_0)}$`
--

* the `$\color{green}{\text{green term}}$` can be studied using the functional delta method
* the `$\color{blue}{\text{blue term}}$` is second-order because `$f_0$` maximizes `$V$` over `$\mathcal{F}$`

In other words: `$f_0$` and `$f_{0,-s}$` **can be treated as known** in studying behavior of `$\psi_{n,s}^*$`!

[Details in Williamson et al. (2021b)]

---

## Preparing for AMP: the full picture

---

## Importance of cross-fitting

A key regularity condition: `$f_n$` and `$f_{n,-s}$` .red[not too complex]

This condition can be removed using .blue1[cross-fitting]

&zwj;Experiments: 
* 2 important features, `$p - 2$` noise features, binary outcome
* Importance measured via accuracy
* Scenario 1: `$p = 4$`, covariates independent
* Scenario 2: `$p \in \{50, 100, 200\}$`, some covariates correlated

---

## Cross-fitting: Scenario 1

---

## Cross-fitting: Scenario 2

---

## Extension: correlated features

So far: importance of `$(X_j: j \in s)$` relative to `$X$`

`$\color{red}{\text{Potential issue}}$`: correlated features

&zwj;Example: two highly correlated features, age and foot size; predicting toddlers' reading ability

* True importance of age = 0 (since foot size is in the model)
* True importance of foot size = 0 (since age is in the model)

&zwj;Idea: average contribution of a feature over all subsets!

True importance of age = average(.blue1[increase in predictiveness from adding age to foot size] & .green[increase in predictiveness from using age over nothing])

Borrowed ideas from game theory to develop a subset-averaged framework

Details in Williamson and Feng (2020)

---

## Extension: longitudinal VIMs

So far: cross-sectional variable importance

Can we do inference on variable importance longitudinally?

---

## Extension: longitudinal VIMs

.blue2[Yes!] And we can do inference on summaries of the time series:
* .blue1[average importance] over a contiguous subset of time points
* .green[linear trend] over a contiguous subset of time points
* .blue2[area under the trajectory] over a contiguous subset of time points

Inference relies on the functional delta method.

Manuscript with details forthcoming.

---

## Closing thoughts

.blue1[Population-based] variable importance:
* wide variety of meaningful measures
* simple estimators
* machine learning okay
* valid inference, testing

Check out the software:

* [R package `vimp`](https://github.com/bdwilliamson/vimp)
* [Python package `vimpy`](https://github.com/bdwilliamson/vimpy)

---

## References

* .small[ Breiman L. 2001. Random forests. _Machine Learning_.]
* .small[ Breiman L. 2001. Statistical modeling: the two cultures. _Statistical Science_.]
* .small[ Chernozhukov V et al. 2018. Double/debiased machine learning for treatment and structural parameters. _The Econometrics Journal_.]
* .small[ Magaret CA, Benkeser DC, Williamson BD, et al. 2019. Prediction of VRC01 neutralization sensitivity by HIV-1 gp160 sequence features. _PLoS Computational Biology_. ]
* .small[ van der Laan MJ. 2006. Statistical inference for variable importance. _The International Journal of Biostatistics_.]
* .small[ van der Laan MJ, Polley EC, and Hubbard AE. 2007. Super Learner. _Statistical Applications in Genetics and Molecular Biology_. ]
* .small[ van der Laan MJ and Rose S. 2011. Targeted learning: causal inference for observational and experimental data. _Springer_.]

---

## References

* .small[ Williamson BD, Magaret CA, Gilbert PB, Nizam S, Simmons C, and Benkeser DC. 2021a. Super LeArner Prediction of NAb Panels (SLAPNAP): a containerized tool for predicting combination monoclonal broadly neutralizing antibody sensitivity. _Bioinformatics_.]
* .small[ Williamson BD, Gilbert P, Carone M, and Simon N. 2020. Nonparametric variable importance assessment using machine learning techniques (+ rejoinder to discussion). _Biometrics_. ]
* .small[ Williamson BD, Gilbert P, Simon N, and Carone M. 2021b. A general framework for inference on algorithm-agnostic variable importance. _Journal of the American Statistical Association_. ]
* .small[ Williamson BD and Feng J. 2020. Efficient nonparametric statistical inference on population feature importance using Shapley values. _ICML_. ]