Inference for model-agnostic variable importance

class: center, middle, title-slide

.title[
# Inference for model-agnostic variable importance
]
.author[
### Brian D. Williamson, PhD<br> <span style="font-size: 75%;"> Kaiser Permanente Washington Health Research Institute </span>
]
.date[
### 25 January, 2024 <br> <a href = 'https://bdwilliamson.github.io/#talks' style = 'color: white;'>https://bdwilliamson.github.io/#talks</a>
]

---

## Acknowledgments

This work was done in collaboration with:
<img src="img/people1.PNG" width="60%" height="30%" style="display: block; margin: auto;" />

---

## Variable importance: what and why

**What is variable importance?**

* .blue1[Quantification of "contributions" of a variable] (or a set of variables)

Traditionally: contribution to .blue2[predictions]
--

* Useful to distinguish between contributions of predictions...
--

* (.blue1[extrinsic importance]) ... .blue1[by a given (possibly black-box) algorithm]
  .small[ [e.g., Breiman, (2001)] ]
--

* (.blue1[intrinsic importance]) ... .blue1[by best possible (i.e., oracle) algorithm]
  .small[ [e.g., van der Laan (2006)] ]
--

* Our work focuses on .blue1[interpretable, model-agnostic intrinsic importance]

Example uses of .blue2[intrinsic] variable importance:
* is it worth extracting text from notes in the EHR for the sake of predicting hospital readmission?

* does the importance of item 9 on the Patient Health Questionnaire in predicting risk of suicide attempt change over time?

---

## Case study: ANOVA importance

Data unit `$(X, Y) \sim P_0$` with:
* outcome `$Y$` 
* covariate `$X := (X_1, X_2, \ldots, X_p)$`

**Goals:** 
* .green[estimate]
* .blue1[and do inference on]

the importance of `$(X_j: j \in s)$` in predicting `$Y$`

How do we typically do this in **linear regression**?

---

## Case study: ANOVA importance

How do we typically do this in **linear regression**?

* Fit a linear regression of `$Y$` on `$X$` `$\rightarrow \color{magenta}{\mu_n(X)}$`
--

* Fit a linear regression of `$Y$` on `$X_{-s}$` `$\rightarrow \color{magenta}{\mu_{n,-s}(X)}$`
--

* .green[Compare the fitted values] `$[\mu_n(X_i), \mu_{n,-s}(X_i)]$`

Many ways to compare fitted values, including:
* ANOVA decomposition
* Difference in `$R^2$`

---

## Case study: ANOVA importance

Difference in `$R^2$`: `$$\left[1 - \frac{n^{-1}\sum_{i=1}^n\{Y_i - \mu_n(X_i)\}^2}{n^{-1}\sum_{i=1}^n\{Y_i - \overline{Y}_n\}^2}\right] - \left[1 - \frac{n^{-1}\sum_{i=1}^n\{Y_i - \mu_{n,-s}(X_i)\}^2}{n^{-1}\sum_{i=1}^n\{Y_i - \overline{Y}_n\}^2}\right]$$`

&zwj;Inference:
* Test difference
* Valid confidence interval

---

## Case study: ANOVA importance

Consider the .blue1[population parameter] `$$\psi_{0,s} = \frac{E_0\{\mu_0(X) - \mu_{0,-s}(X)\}^2}{var_0(Y)}$$`

* `$\mu_0(x) := E_0(Y \mid X = x)$` .blue1[(true conditional mean)]
* `$\mu_{0,-s}(x) := E_0(Y \mid X_{-s} = x_{-s})$`

[for a vector `$z$`, `$z_{-s}$` represents `$(z_j: j \notin s)$`]

* .blue2[nonparametric extension] of linear regression-based ANOVA parameter

* Can be expressed as a `$\color{magenta}{\text{difference in population } R^2}$` values, since `$$\color{magenta}{\psi_{0,s} = \left[1 - \frac{E_0\{Y - \mu_0(X)\}^2}{var_0(Y)}\right] - \left[1 - \frac{E_0\{Y - \mu_{0,-s}(X)\}^2}{var_0(Y)}\right]}$$`

---

## Case study: ANOVA importance

How should we make inference on `$\psi_{0,s}$`?
--

1. construct estimators `$\mu_n$`, `$\mu_{n,-s}$` of `$\mu_0$` and `$\mu_{0,-s}$` (e.g., with machine learning)
--

2. plug in: `$$\psi_{n,s} := \frac{\frac{1}{n}\sum_{i=1}^n \{\mu_n(X_i) - \mu_{n,-s}(X_i)\}^2}{\frac{1}{n}\sum_{i=1}^n (Y_i - \overline{Y}_n)^2}$$`
--

but this estimator has .red[asymptotic bias]
--

3. using influence function-based debiasing [e.g., Pfanzagl (1982)], we get estimator `$$\color{magenta}{\psi_{n,s}^* := \left[1 - \frac{\frac{1}{n}\sum_{i=1}^n\{Y_i - \mu_n(X_i)\}^2}{\frac{1}{n}\sum_{i=1}^n (Y_i - \overline{Y}_n)^2}\right] - \left[1 - \frac{\frac{1}{n}\sum_{i=1}^n\{Y_i - \mu_{n,-s}(X_i)\}^2}{\frac{1}{n}\sum_{i=1}^n (Y_i - \overline{Y}_n)^2}\right]}$$`

Under regularity conditions, `$\psi_{n,s}^*$` is consistent and nonparametric efficient.

In particular, `$\sqrt{n}(\psi_{n,s}^* - \psi_{0,s})$` has a mean-zero normal limit with estimable variance.

[Details in Williamson et al. (2021)]

---

## Preparing for AMP

* 611 HIV-1 pseudoviruses
* Outcome: neutralization sensitivity/resistance to antibody

**Goal:** pre-screen features for inclusion in secondary analysis
* 800 individual features, 13 groups of interest

&zwj;Procedure: 
1. Estimate `$\mu_n$`, `$\mu_{n,s}$` using Super Learner [van der Laan et al. (2007)]
2. Estimate and do inference on variable importance `$\psi_{n,s}^*$`

.small[ [Details in Magaret et al. (2019) and Williamson et al. (2021b)] ]

---

## Preparing for AMP: SL performance

.pull-left[
<img src="img/sl_perf_ic50.censored.png" width="100%" style="display: block; margin: auto;" />
]
--

.pull-right[
<img src="img/sl_roc_ic50.censored.png" width="600px" style="display: block; margin: auto;" />
]
---

## Preparing for AMP: R-squared

---

## Preparing for AMP: R-squared

.small[ Magaret et al. (2019) ]
---

## Generalization to arbitrary measures

ANOVA example suggests a natural generalization:
--

* Choose a relevant measure of .blue1[predictiveness] for the task at hand

* `$V(f, P) =$` .blue1[predictiveness] of function `$f$` under sampling from `$P$`
  * `$\mathcal{F} =$` rich class of candidate prediction functions
  * `$\mathcal{F}_{-s} =$` {all functions in `$\mathcal{F}$` that ignore components with index in `$s$`} `$\subset \mathcal{F}$`
  
--

* Define the oracle prediction functions

`$f_0:=$` maximizer of `$V(f, P_0)$` over `$\mathcal{F}$` & `$f_{0,-s}:=$` maximizer of `$V(f, P_0)$` over `$\mathcal{F}_{-s}$`

Define the importance of `$(X_j: j \in s)$` relative to `$X$` as `$$\color{magenta}{\psi_{0,s} := V(f_0, P_0) - V(f_{0,-s}, P_0) \geq 0}$$`

---

## Generalization to arbitrary measures

Some examples of predictiveness measures:

(arbitrary outcomes)

&zwj; `$R^2$`: `$V(f, P) = 1 - E_P\{Y - f(X)\}^2 / var_P(Y)$`

(binary outcomes)

Classification accuracy: `$V(f, P) = P\{Y = f(X)\}$`

&zwj;AUC: `$V(f, P) = P\{f(X_1) < f(X_2) \mid Y_1 = 0, Y_2 = 1\}$` for `$(X_1, Y_1) \perp (X_2, Y_2)$`

Pseudo- `$R^2$` : `$1 - \frac{E_P[Y \log f(X) - (1 - Y)\log \{1 - f(X)\}]}{P(Y = 1)\log P(Y = 1) + P(Y = 0)\log P(Y = 0)}$`

---

## Generalization to arbitrary measures

How should we make inference on `$\psi_{0,s}$`?
--

1. construct estimators `$f_n$`, `$f_{n,-s}$` of `$f_0$` and `$f_{0,-s}$` (e.g., with machine learning)
--

2. plug in: `$$\psi_{n,s}^* := V(f_n, P_n) - V(f_{n,-s}, P_n)$$`
  
  where `$P_n$` is the empirical distribution based on the available data
--

3. Inference can be carried out using influence functions.
--
 Why?

We can write `$V(f_n, P_n) - V(f_{0}, P_0) \approx \color{green}{V(f_0, P_n) - V(f_0, P_0)} + \color{blue}{V(f_n, P_0) - V(f_0, P_0)}$`
--

* the `$\color{green}{\text{green term}}$` can be studied using the functional delta method
* the `$\color{blue}{\text{blue term}}$` is second-order because `$f_0$` maximizes `$V$` over `$\mathcal{F}$`

In other words: `$f_0$` and `$f_{0,-s}$` **can be treated as known** in studying behavior of `$\psi_{n,s}^*$`!

[Details in Williamson et al. (2022)]

---

## Preparing for AMP: the full picture

[AMP results in Juraska et al. (2023)]
---

## Survival analysis

.pull-left[
 Survival analysis requires special care:
 * `$T$`: event time of interest
 * `$C$`: censoring time
 * Ideal data unit: `$(X,T,C)$`
 * Observed data unit: `$(X,Y,\Delta)$`, where
  * `$Y = \min\{T,C\}$`
  * `$\Delta = I(T \leq C)$`
]

.pull-right[
<img src="img/charlie.png" width="30%" style="display: block; margin: auto;" />
]

---

## Survival analysis

Example predictiveness measures: (all rely on a landmark time `$\tau$`)

&zwj;AUC at `$\tau$`: `$V(f, P) = P\{f(X_1) > f(X_2) \mid T_1 \leq \tau, T_2 > \tau\}$`
* `$f$` is a risk score

Brier score: `$V(f, P) = -E[\{f(X) - I(T > \tau)\}^2]$` 
* `$f$` predicts the binary outcome `$I(T > \tau)$`

Survival time MSE: `$V(f, P) = -E[\{f(X) - \max(T, \tau)\}^2]$`
* `$f$` predicts a participant's `$\tau$`-restricted survival time

&zwj;C-index: `$V(f, P) = P\{f(X_1) > f(X_2) \mid T_1 < T_2, T_1 \leq \tau\}$`

---

## Survival analysis

Further required conditions:
* `$T$` and `$C$` are conditionally independent given `$X$`
* there exists some `$\tau_0 \in (0, \infty)$` satisfying:
  * `$P(C > \tau_0 \mid X) > 0$` almost surely
  
--

Also required:
* estimating the oracle prediction function (often the conditional survival function)
* the time-to-event distribution function
* the conditional censoring survival function

Estimating the cumulative hazard requires additional debiasing

[Details in Wolock et al. (2023)]

---

## Extension: correlated features

So far: importance of `$(X_j: j \in s)$` relative to `$X$`

`$\color{red}{\text{Potential issue}}$`: correlated features

&zwj;Example: two highly correlated features, age and foot size; predicting toddlers' reading ability

* True importance of age = 0 (since foot size is in the model)
* True importance of foot size = 0 (since age is in the model)

&zwj;Idea: average contribution of a feature over all subsets!

True importance of age = average(.blue1[increase in predictiveness from adding age to foot size] & .green[increase in predictiveness from using age over nothing])

Borrowed ideas from game theory to develop a subset-averaged framework

[Details in Williamson and Feng (2020)]

---

## Extension: longitudinal importance

So far: cross-sectional variable importance

Can we do inference on variable importance longitudinally?

---

## Summarizing a VIM trajectory

&zwj;Define: 
* contiguous set of timepoints `$\tau := [t_0, t_1]$`
* variable importance at each time point `$\psi_{0,s,t}$`, `$t \in \tau$`

Examples of summary measures over `$\tau$`, `$m(\psi_{0,s,\tau})$`:

&zwj;Mean: `$\lVert \tau \rVert^{-1} \sum_{t \in \tau} \psi_{0,s,t}$`

Linear trend: `$(\beta_0, \beta_1) = \text{arg min}_{(\alpha_1, \alpha_2) \in \mathbb{R}^2} \lVert \psi_{0,s,t} - \alpha_1  - t \alpha_2\rVert_2^2$`

---

## Summarizing a VIM trajectory

| Summary | VIM 1 | VIM 2 | VIM 3 |
|:-------:--------:-------:-------|
| Mean    | 0.3   | 0.8   | 0.06  |
| Slope   | 0.1   | 0     | -0.05 |

---

## Summarizing a VIM trajectory

How should we make inference on `$m(\psi_{0,s,\tau})$`?
--

1. construct estimators `$f_{n,t}$`, `$f_{n,-s,t}$` of `$f_{0,t}$`, `$f_{0,-s,t}$` (e.g., with machine learning)

2. plug in: `$\psi_{n,s,t}^* = V(f_{n,s,t}, P_{n,s,t}) - V(f_{n,-s,t}, P_{n,s,t})$`

3. plug in: `$$m_{n,s}^* := m(\psi_{n,s,\tau}^*)$$`

4. Inference can be carried out using influence functions.

[Details in Williamson et al. (2023)]

---

## VIMs for predictors of suicide risk

Data gathered from electronic health record on sample of 343,950 visits made by 184,782 people

Key variables: .small[ Jacobs et al. (2010) ]
* Patient Health Questionnaire (PHQ)
  * PHQ-8 total score (depressive symptoms) 
  * PHQi9 (suicidal ideation)
* Prior recorded self-harm
* Age
* Sex (sex assigned at birth)

Outcome: suicide attempt in 90 days following mental health visit

Sampled one visit per person at six possible measurement times over 18 months:
* 99,991 people had only one visit
* 4,093 people had six visits
* Rate of suicide attempt approximately 0.5% at all time points

---

## VIMs for predictors of suicide risk

&zwj;Goal: estimate VIMs for PHQi9 and prior recorded self-harm

Variable sets considered:
1. no variables
2. PHQi9 alone
3. age and sex (base set)
4. age, sex, and PHQi9
5. age, sex, and prior self-harm
6. age, sex, prior self-harm, and PHQi9

Estimate prediction functions at each time point using the Super Learner [ van der Laan et al. (2007) ]

---

## VIMs for predictors of suicide risk

---

## VIMs for predictors of suicide risk

<table class="table" style="font-size: 14px; margin-left: auto; margin-right: auto;">
 <thead>
  <tr>
   <th style="text-align:left;"> Summary </th>
   <th style="text-align:left;"> Comparison </th>
   <th style="text-align:left;"> Estimate </th>
   <th style="text-align:left;"> SE </th>
   <th style="text-align:left;"> 95% CI </th>
   <th style="text-align:left;"> p-value </th>
  </tr>
 </thead>
<tbody>
  <tr>
   <td style="text-align:left;"> Mean </td>
   <td style="text-align:left;"> PHQi9 versus no variables </td>
   <td style="text-align:left;"> 0.132 </td>
   <td style="text-align:left;"> 0.019 </td>
   <td style="text-align:left;"> [0.096, 0.169] </td>
   <td style="text-align:left;"> &lt; 0.001 </td>
  </tr>
  <tr>
   <td style="text-align:left;"> Mean </td>
   <td style="text-align:left;"> PHQi9 versus age and sex </td>
   <td style="text-align:left;"> 0.040 </td>
   <td style="text-align:left;"> 0.012 </td>
   <td style="text-align:left;"> [0.016, 0.064] </td>
   <td style="text-align:left;"> &lt; 0.001 </td>
  </tr>
  <tr>
   <td style="text-align:left;"> Mean </td>
   <td style="text-align:left;"> PHQi9 versus age, sex, and prior self-harm variables </td>
   <td style="text-align:left;"> 0.033 </td>
   <td style="text-align:left;"> 0.012 </td>
   <td style="text-align:left;"> [0.010, 0.056] </td>
   <td style="text-align:left;"> 0.002 </td>
  </tr>
  <tr>
   <td style="text-align:left;"> Trend: slope </td>
   <td style="text-align:left;"> PHQi9 versus no variables </td>
   <td style="text-align:left;"> 0.007 </td>
   <td style="text-align:left;"> 0.011 </td>
   <td style="text-align:left;"> [-0.014, 0.028] </td>
   <td style="text-align:left;"> 0.507 </td>
  </tr>
  <tr>
   <td style="text-align:left;"> Trend: slope </td>
   <td style="text-align:left;"> PHQi9 versus age and sex </td>
   <td style="text-align:left;"> -0.005 </td>
   <td style="text-align:left;"> 0.007 </td>
   <td style="text-align:left;"> [-0.019, 0.009] </td>
   <td style="text-align:left;"> 0.486 </td>
  </tr>
  <tr>
   <td style="text-align:left;"> Trend: slope </td>
   <td style="text-align:left;"> PHQi9 versus age, sex, and prior self-harm variables </td>
   <td style="text-align:left;"> -0.005 </td>
   <td style="text-align:left;"> 0.007 </td>
   <td style="text-align:left;"> [-0.018, 0.008] </td>
   <td style="text-align:left;"> 0.468 </td>
  </tr>
</tbody>
</table>

---

## Closing thoughts

.blue1[Population-based] variable importance:
* wide variety of meaningful measures
* simple estimators
* machine learning okay
* valid inference, testing
* extension to survival analysis
* extension to longitudinal VIMs
* extension to correlated features

Check out the software:

* R packages [`vimp`](https://github.com/bdwilliamson/vimp), [`lvimp`](https://github.com/bdwilliamson/lvimp)
* [Python package `vimpy`](https://github.com/bdwilliamson/vimpy)
* Survival analysis: R package [`survML`](https://github.com/cwolock/survML) (will be incorporated into `vimp` soon)

---

## References

* .small[ Breiman L. 2001. Random forests. _Machine Learning_.]
* .small[ Magaret, CA and Benkeser, DC and Williamson, BD et al. 2019. Prediction of VRC01 neutralization sensitivity by HIV-1 gp160 sequence features. _PLOS Computational Biology_. ]
* .small[ Juraska M, et al. 2023. Prevention efficacy of the broadly neutralizing antibody VRC01 depends on HIV-1 envelope sequence features. _Proceedings of the National Academy of Sciences_. ]
* .small[ van der Laan MJ. 2006. Statistical inference for variable importance. _The International Journal of Biostatistics_. ]
* .small[ van der Laan MJ, Polley EC, and Hubbard AE. 2007. Super Learner. _Statistical Applications in Genetics and Molecular Biology_. ]
* .small[ Williamson BD, Gilbert P, Carone M, and Simon N. 2021. Nonparametric variable importance assessment using machine learning techniques (+ rejoinder to discussion). _Biometrics_. ]
* .small[ Williamson BD, Gilbert P, Simon N, and Carone M. 2022. A general framework for inference on algorithm-agnostic variable importance. _Journal of the American Statistical Association_. ]
* .small[ Williamson BD, Moodie EEM, and Shortreed SM. 2023. Inference on summaries of a model-agnostic longitudinal variable importance trajectory. _arXiv https://arxiv.org/pdf/2311.01638.pdf_. ]
* .small[ Williamson BD and Feng J. 2020. Efficient nonparametric statistical inference on population feature importance using Shapley values. _ICML_. ]
* .small[ Wolock CJ, Gilbert PB, Simon N, and Carone M. 2023. Nonparametric variable importance for time-to-event outcomes with application to prediction of HIV infection. _arXiv https://arxiv.org/pdf/2311.12726.pdf_. ]