Inference for model-agnostic variable importance

class: center, middle, title-slide

# Inference for model-agnostic variable importance
### Brian D. Williamson, PhD<br> <span style="font-size: 75%;"> Fred Hutchinson Cancer Research Center </span>
### 19 January, 2021<br> <a href = 'https://bdwilliamson.github.io/#talks' style = 'color: white;'>https://bdwilliamson.github.io/#talks</a>

---

## Acknowledgments

This work was done in collaboration with:
<img src="img/people1.PNG" width="65%" style="display: block; margin: auto;" />
<img src="img/people2.PNG" width="55%" style="display: block; margin: auto;" />

---

## Motivation

---

## Motivation: HIV envelope and antibody targets

.small[Source: Koff and Berkley (2010)]

---

## Motivation: AMP

AMP overall objective: assess .blue2[VRC01] .blue1[prevention efficacy] (PE) against HIV-1

* &zwj;VRC01: broadly neutralizing antibody (bnAb) isolated from donor

Key secondary question:

.green[Which genetic mutations] make HIV-1 .purple[susceptible] to neutralization?

&zwj;Challenges:
* How should we measure susceptibility?
--

* How do we determine if a mutation has a real effect?
--

at .red[many] positions?
--

* Can we use .mutedred[machine learning]?

---

## Data on viral neutralization sensitivity

&zwj;CATNAP: publicly-available database .small[ [Yoon et al. (2015)] ]:

* `$\text{IC}_{50}$` and `$\text{IC}_{80}$` neutralization values from TZM-bl assay
  - `$\text{IC}_{x}$`: concentration that neutralizes `$x$` percent of pseudoviruses
  
--

---

## Data on viral neutralization sensitivity

&zwj;CATNAP: publicly-available database .small[ [Yoon et al. (2015)] ]:

* `$\text{IC}_{50}$` and `$\text{IC}_{80}$` neutralization values from TZM-bl assay
  - `$\text{IC}_{x}$`: concentration that neutralizes `$x$` percent of pseudoviruses
  
<img src="index_files/figure-html/ic50-example-2-1.png" width="37%" style="display: block; margin: auto;" />

---

## Data on viral neutralization sensitivity

&zwj;CATNAP: publicly-available database .small[ [Yoon et al. (2015)] ]:

* `$\text{IC}_{50}$` and `$\text{IC}_{80}$` neutralization values from TZM-bl assay
  - `$\text{IC}_{x}$`: concentration that neutralizes `$x$` percent of pseudoviruses
  
<img src="index_files/figure-html/ic50-example-3-1.png" width="37%" style="display: block; margin: auto;" />

---

## Data on viral neutralization sensitivity

&zwj;CATNAP: publicly-available database .small[ [Yoon et al. (2015)] ]:

* `$\text{IC}_{50}$` and `$\text{IC}_{80}$` neutralization values from TZM-bl assay
  - `$\text{IC}_{x}$`: concentration that neutralizes `$x$` percent of pseudoviruses
  
<img src="index_files/figure-html/ic50-example-4-1.png" width="37%" style="display: block; margin: auto;" />

Define .blue1[sensitivity] = `$\text{IC}_{50} < 1$` (83% sensitive in CATNAP)
---

## Data on viral neutralization sensitivity

<img src="img/vrc01_features.png" width="100%" style="display: block; margin: auto;" />
  
--

For VRC01: 611 observations; 800 individual features

.small[ [Details in Magaret et al. (2019)] ]
  
---

## Variable importance: what and why

**What is variable importance?**
* .blue1[Quantification of "contributions" of a variable] (or a set of variables)

Traditionally: contribution to .blue2[predictions]
--

* Useful to distinguish between contributions of predictions...
--

* (.blue1[extrinsic importance]) ... .blue1[by a given (possibly black-box) algorithm]
  .small[ [e.g., Breiman, (2001)] ]
--

* (.blue1[intrinsic importance]) ... .blue1[by best possible (i.e., oracle) algorithm]
  .small[ [e.g., van der Laan (2006)] ]
--

* Our work focuses on .blue1[interpretable, model-agnostic intrinsic importance]

Example uses of .blue2[intrinsic] variable importance:
* is it worth extracting text from notes in the EHR for the sake of predicting hospital readmission?

* is it worth collecting a given covariate for the sake of predicting neutralization sensitivity?

---

## Case study: ANOVA importance

Data unit `$(X, Y) \sim P_0$` with:
* outcome `$Y$` 
* covariate `$X := (X_1, X_2, \ldots, X_p)$`

**Goals:** 
* .green[estimate]
* .blue1[and do inference on]

the importance of `$(X_j: j \in s)$` in predicting `$Y$`

How do we typically do this in **linear regression**?

---

## Case study: ANOVA importance

How do we typically do this in **linear regression**?

* Fit a linear regression of `$Y$` on `$X$` `$\rightarrow \color{magenta}{\hat{\mu}(X)}$`
--

* Fit a linear regression of `$Y$` on `$X_{-s}$` `$\rightarrow \color{magenta}{\hat{\mu}_s(X)}$`
--

* .green[Compare the fitted values] `$[\hat{\mu}(X_i), \hat{\mu}_s(X_i)]$`

Many ways to compare fitted values, including:
* ANOVA decomposition
* Difference in `$R^2$`

---

## Case study: ANOVA importance

Difference in `$R^2$`: `$$\left[1 - \frac{n^{-1}\sum_{i=1}^n\{Y_i - \hat{\mu}(X_i)\}^2}{n^{-1}\sum_{i=1}^n\{Y_i - \overline{Y}_n\}^2}\right] - \left[1 - \frac{n^{-1}\sum_{i=1}^n\{Y_i - \hat{\mu}_s(X_i)\}^2}{n^{-1}\sum_{i=1}^n\{Y_i - \overline{Y}_n\}^2}\right]$$`

&zwj;Inference:
* Test difference
* Valid confidence interval

---

## Case study: ANOVA importance

Consider the .blue1[population parameter] `$$\psi_{0,s} = \frac{E_0\{\mu_0(X) - \mu_{0,s}(X)\}^2}{var_0(Y)}$$`

* `$\mu_0(x) := E_0(Y \mid X = x)$` .blue1[(true conditional mean)]
* `$\mu_{0,s}(x) := E_0(Y \mid X_{-s} = x_{-s})$`

[for a vector `$z$`, `$z_{-s}$` represents `$(z_j: j \notin s)$`]

* .blue2[nonparametric extension] of linear regression-based ANOVA parameter

* Can be expressed as a `$\color{magenta}{\text{difference in population } R^2}$` values, since `$$\color{magenta}{\psi_{0,s} = \left[1 - \frac{E_0\{Y - \mu_0(X)\}^2}{var_0(Y)}\right] - \left[1 - \frac{E_0\{Y - \mu_{0,s}(X)\}^2}{var_0(Y)}\right]}$$`

---

## Case study: ANOVA importance

How should we make inference on `$\psi_{0,s}$`?
--

1. construct estimators `$\mu_n$`, `$\mu_{n,s}$` of `$\mu_0$` and `$\mu_{0,s}$` (e.g., with machine learning)
--

2. plug in: `$$\psi_{n,s} := \frac{\frac{1}{n}\sum_{i=1}^n \{\mu_n(X_i) - \mu_{n,s}(X_i)\}^2}{\frac{1}{n}\sum_{i=1}^n (Y_i - \overline{Y}_n)^2}$$`
--

but this estimator has .red[asymptotic bias]
--

3. using influence function-based debiasing [e.g., Pfanzagl (1982)], we get estimator `$$\color{magenta}{\psi_{n,s}^* := \left[1 - \frac{\frac{1}{n}\sum_{i=1}^n\{Y_i - \mu_n(X_i)\}^2}{\frac{1}{n}\sum_{i=1}^n (Y_i - \overline{Y}_n)^2}\right] - \left[1 - \frac{\frac{1}{n}\sum_{i=1}^n\{Y_i - \mu_{n,s}(X_i)\}^2}{\frac{1}{n}\sum_{i=1}^n (Y_i - \overline{Y}_n)^2}\right]}$$`

---

## Case study: ANOVA importance

`$$\color{magenta}{\psi_{n,s}^* := \left[1 - \frac{\frac{1}{n}\sum_{i=1}^n\{Y_i - \mu_n(X_i)\}^2}{\frac{1}{n}\sum_{i=1}^n (Y_i - \overline{Y}_n)^2}\right] - \left[1 - \frac{\frac{1}{n}\sum_{i=1}^n\{Y_i - \mu_{n,s}(X_i)\}^2}{\frac{1}{n}\sum_{i=1}^n (Y_i - \overline{Y}_n)^2}\right]}$$`

Key observations:
* `$\psi_{n,s}^* =$` plug-in estimator of `$\psi_{0,s}$` based on difference-in- `$R^2$` representation
--

* .blue1[No need to debias] the difference-in- `$R^2$` estimator!
--

* Why does this happen?

.blue2[Estimation of] `$\mu_{0}$` .blue2[and] `$\mu_{0,s}$` .blue2[yields only second-order terms, so estimator behaves as if they are **known**]
  
--

Under regularity conditions, `$\psi_{n,s}^*$` is consistent and nonparametric efficient.

In particular, `$\sqrt{n}(\psi_{n,s}^* - \psi_{0,s})$` has a mean-zero normal limit with estimable variance.

[Details in Williamson et al. (2020a)]
---

## Preparing for AMP

* 611 HIV-1 pseudoviruses
* Outcome: neutralization sensitivity/resistance to antibody

**Goal:** pre-screen features for inclusion in secondary analysis
* 800 individual features, 13 groups of interest

&zwj;Procedure: 
1. Estimate `$\mu_n$`, `$\mu_{n,s}$` using Super Learner [van der Laan et al. (2007)]
2. Estimate and do inference on variable importance `$\psi_{n,s}^*$`

.small[ [Details in Magaret et al. (2019) and Williamson et al. (2020b)] ]

---

## Preparing for AMP: SL performance

.pull-left[
<img src="img/sl_perf_ic50.censored.png" width="100%" style="display: block; margin: auto;" />
]
--

.pull-right[
<img src="img/sl_roc_ic50.censored.png" width="600px" style="display: block; margin: auto;" />
]
---

## Preparing for AMP: R-squared

---

## Preparing for AMP: R-squared

.small[ Magaret et al. (2019) ]
---

## Generalization to arbitrary measures

ANOVA example suggests a natural generalization:
--

* Choose a relevant measure of .blue1[predictiveness] for the task at hand

* `$V(f, P) =$` .blue1[predictiveness] of function `$f$` under sampling from `$P$`
  * `$\mathcal{F} =$` rich class of candidate prediction functions
  * `$\mathcal{F}_{-s} =$` {all functions in `$\mathcal{F}$` that ignore components with index in `$s$`} `$\subset \mathcal{F}$`
  
--

* Define the oracle prediction functions

`$f_0:=$` maximizer of `$V(f, P_0)$` over `$\mathcal{F}$` & `$f_{0,s}:=$` maximizer of `$V(f, P_0)$` over `$\mathcal{F}_{-s}$`

Define the importance of `$(X_j: j \in s)$` relative to `$X$` as `$$\color{magenta}{\psi_{0,s} := V(f_0, P_0) - V(f_{0,s}, P_0) \geq 0}$$`

---

## Generalization to arbitrary measures

Some examples of predictiveness measures:

(arbitrary outcomes)

&zwj; `$R^2$`: `$V(f, P) = 1 - E_P\{Y - f(X)\}^2 / var_P(Y)$`

(binary outcomes)

Classification accuracy: `$V(f, P) = P\{Y = f(X)\}$`

&zwj;AUC: `$V(f, P) = P\{f(X_1) < f(X_2) \mid Y_1 = 0, Y_2 = 1\}$` for `$(X_1, Y_1) \perp (X_2, Y_2)$`

Pseudo- `$R^2$` : `$1 - \frac{E_P[Y \log f(X) - (1 - Y)\log \{1 - f(X)\}]}{P(Y = 1)\log P(Y = 1) + P(Y = 0)\log P(Y = 0)}$`

---

## Generalization to arbitrary measures

How should we make inference on `$\psi_{0,s}$`?
--

1. construct estimators `$f_n$`, `$f_{n,s}$` of `$f_0$` and `$f_{0,s}$` (e.g., with machine learning)
--

2. plug in: `$$\psi_{n,s}^* := V(f_n, P_n) - V(f_{n,s}, P_n)$$`
  
  where `$P_n$` is the empirical distribution based on the available data
--

3. Inference can be carried out using influence functions.
--
 Why?

We can write `$V(f_n, P_n) - V(f_{n,s}, P_n) \approx \color{green}{V(f_0, P_n) - V(f_0, P_0)} + \color{blue}{V(f_n, P_0) - V(f_0, P_0)}$`
--

* the `$\color{green}{\text{green term}}$` can be studied using the functional delta method
* the `$\color{blue}{\text{blue term}}$` is second-order because `$f_0$` maximizes `$V$` over `$\mathcal{F}$`

In other words: `$f_0$` and `$f_{0,s}$` **can be treated as known** in studying behavior of `$\psi_{n,s}^*$`!

[Details in Williamson et al. (2020b)]

---

## Preparing for AMP: the full picture

---

## Preparing for AMP: the full picture

&zwj;Implications:
--

* All sites in the VRC01 binding footprint, CD4 binding sites appear important
--

* Results may differ based on chosen measure

&zwj;Current work:
--

* New analysis based on updated CATNAP database
* Incorporating results from AMP primary analysis (to appear)

---

## Summary (so far)

.blue1[Intrinsic importance = predictiveness potential] of variable (or set of variables).

Simple estimators of importance are
* .blue1[unbiased] and
* .blue2[efficient]

even if machine learning techniques are used.

Implemented in packages `vimp` ([CRAN](https://cran.r-project.org/web/packages/vimp/), [GitHub](https://github.com/bdwilliamson/vimp)) and `vimpy` ([PyPI](https://pypi.org/project/vimpy/))

Several limitations:
* current approach relies on .green[sample-splitting] for hypothesis testing
* .red[highly correlated features] require additional care
* .red[single bnAb not likely to confer full protection against HIV-1]

---

## Extension: correlated features

So far: importance of `$(X_j: j \in s)$` relative to `$X$`

`$\color{red}{\text{Potential issue}}$`: correlated features

&zwj;Example: two highly correlated features, age and foot size; predicting toddlers' reading ability

* True importance of age = 0 (since foot size is in the model)
* True importance of foot size = 0 (since age is in the model)

&zwj;Idea: average contribution of a feature over all subsets!

True importance of age = average(.blue1[increase in predictiveness from adding age to foot size] & .green[increase in predictiveness from using age over nothing])

Borrowed ideas from game theory to develop a subset-averaged framework

Details in Williamson and Feng (2020)

---

## Extension: combination regimens

&zwj;AMP: testing PE of a single bnAb

Current regimens involve `$>1$` bnAb

Key goal of trials networks: prioritizing combination bnAb regimens

Developed software `SLAPNAP` ([GitHub](https://github.com/benkeser/slapnap), [DockerHub](https://hub.docker.com/r/slapnap/slapnap)) to aid this effort
* Can help guide ranking of regimens by predicted PE
* Can help guide secondary analyses in efficacy trials

Details in Benkeser et al. (2020)

---

## Current and future work

Collaborative science:
* Longitudinal correlates of risk of hospitalized dengue disease in dengue vaccine trials
* Prioritizing bnAb regimens for further clinical testing
* Identifying metabolomic biomarkers as predictors of breast and colorectal cancer
* Correlates of risk of COVID-19 in vaccine efficacy trials

---

## Current and future work

Statistical methodology:

* Variable selection:

* Procedures that control family-wise error rate and false discovery rate
  
--
  
  * Handling complex missing data using imputation and weighting methods
  
--
  
* Variable importance:

* Studying and proposing further clinically useful measures
  
--
  
  * Improving power in hypothesis testing
  
--
  
* Immune correlates of protection

* Incorporating machine learning into phase 4 vaccine safety and effectiveness studies

---

## Closing thoughts

.blue1[Population-based] variable importance:
* wide variety of meaningful measures
* simple estimators
* machine learning okay
* valid inference, testing

Two interpretations:
* conditional
* subset-averaged

.center[
<i class="fab  fa-github "></i> https://github.com/bdwilliamson | 
<i class="fas  fa-globe "></i> https://bdwilliamson.github.io
]
---

## References

* .small[ Benkeser, DC, Williamson BD, Magaret CA, Nizam S, and Gilbert PB. 2020. Super LeArner Prediction of NAb Panels (SLAPNAP): A Containerized Tool for Predicting Combination Monoclonal Broadly Neutralizing Antibody Sensitivity. _BioRxiv technical report_.]
* .small[ Breiman, L. 2001. Random forests. _Machine Learning_.]
* .small[ Koff, WC and Berkley, SF. 2010. The Renaissance in HIV vaccine development -- Future Directions. _The New England Journal of Medicine_  ]
* .small[ Magaret CA, Benkeser DC, Williamson BD, et al. 2019. Prediction of VRC01 neutralization sensitivity by HIV-1 gp160 sequence features. _PLoS Computational Biology_. ]
* .small[ van der Laan, MJ. 2006. Statistical inference for variable importance. _The International Journal of Biostatistics_.]
* .small[ van der Laan MJ, Polley EC, and Hubbard AE. 2007. Super Learner. _Statistical Applications in Genetics and Molecular Biology_. ]

---

## References

* .small[ Williamson B, Gilbert P, Carone M, and Simon N. 2020a. Nonparametric variable importance assessment using machine learning techniques (+ rejoinder to discussion). _Biometrics_. ]
* .small[ Williamson B, Gilbert P, Simon N, and Carone M. 2020b. A unified approach for inference on algorithm-agnostic variable importance. _ArXiv technical report_. ]
* .small[ Williamson B and Feng J. 2020. Efficient nonparametric statistical inference on population feature importance using Shapley values. _ICML_. ]
* .small[ Yoon H, Macke J, West AP, et al. 2015. CATNAP: a tool to compile, analyze and tally neutralizing antibody panels. _Nucleic Acids Research_. ]