class: center, middle, title-slide # Inference for model-agnostic variable importance ### Brian D. Williamson, PhD
Fred Hutchinson Cancer Research Center
### 19 January, 2021
https://bdwilliamson.github.io/#talks
--- <style type="text/css"> .remark-slide-content { font-size: 20px; header-h2-font-size: 1.75rem; } </style> ## Acknowledgments This work was done in collaboration with: <img src="img/people1.PNG" width="65%" style="display: block; margin: auto;" /> <img src="img/people2.PNG" width="55%" style="display: block; margin: auto;" /> --- ## Motivation <img src="img/examples.png" height="450px" style="display: block; margin: auto;" /> --- ## Motivation: HIV envelope and antibody targets <img src="img/hiv_env.png" width="90%" style="display: block; margin: auto;" /> .small[Source: Koff and Berkley (2010)] --- ## Motivation: AMP AMP overall objective: assess .blue2[VRC01] .blue1[prevention efficacy] (PE) against HIV-1 * ‍VRC01: broadly neutralizing antibody (bnAb) isolated from donor -- Key secondary question: .green[Which genetic mutations] make HIV-1 .purple[susceptible] to neutralization? -- ‍Challenges: * How should we measure susceptibility? -- * How do we determine if a mutation has a real effect? -- at .red[many] positions? -- * Can we use .mutedred[machine learning]? --- ## Data on viral neutralization sensitivity ‍CATNAP: publicly-available database .small[ [Yoon et al. (2015)] ]: * `\(\text{IC}_{50}\)` and `\(\text{IC}_{80}\)` neutralization values from TZM-bl assay - `\(\text{IC}_{x}\)`: concentration that neutralizes `\(x\)` percent of pseudoviruses -- <img src="index_files/figure-html/ic50-example-1.png" width="37%" style="display: block; margin: auto;" /> --- ## Data on viral neutralization sensitivity ‍CATNAP: publicly-available database .small[ [Yoon et al. (2015)] ]: * `\(\text{IC}_{50}\)` and `\(\text{IC}_{80}\)` neutralization values from TZM-bl assay - `\(\text{IC}_{x}\)`: concentration that neutralizes `\(x\)` percent of pseudoviruses <img src="index_files/figure-html/ic50-example-2-1.png" width="37%" style="display: block; margin: auto;" /> --- ## Data on viral neutralization sensitivity ‍CATNAP: publicly-available database .small[ [Yoon et al. (2015)] ]: * `\(\text{IC}_{50}\)` and `\(\text{IC}_{80}\)` neutralization values from TZM-bl assay - `\(\text{IC}_{x}\)`: concentration that neutralizes `\(x\)` percent of pseudoviruses <img src="index_files/figure-html/ic50-example-3-1.png" width="37%" style="display: block; margin: auto;" /> --- ## Data on viral neutralization sensitivity ‍CATNAP: publicly-available database .small[ [Yoon et al. (2015)] ]: * `\(\text{IC}_{50}\)` and `\(\text{IC}_{80}\)` neutralization values from TZM-bl assay - `\(\text{IC}_{x}\)`: concentration that neutralizes `\(x\)` percent of pseudoviruses <img src="index_files/figure-html/ic50-example-4-1.png" width="37%" style="display: block; margin: auto;" /> -- Define .blue1[sensitivity] = `\(\text{IC}_{50} < 1\)` (83% sensitive in CATNAP) --- ## Data on viral neutralization sensitivity <img src="img/vrc01_features.png" width="100%" style="display: block; margin: auto;" /> -- For VRC01: 611 observations; 800 individual features .small[ [Details in Magaret et al. (2019)] ] --- ## Variable importance: what and why **What is variable importance?** * .blue1[Quantification of "contributions" of a variable] (or a set of variables) -- Traditionally: contribution to .blue2[predictions] -- * Useful to distinguish between contributions of predictions... -- * (.blue1[extrinsic importance]) ... .blue1[by a given (possibly black-box) algorithm] .small[ [e.g., Breiman, (2001)] ] -- * (.blue1[intrinsic importance]) ... .blue1[by best possible (i.e., oracle) algorithm] .small[ [e.g., van der Laan (2006)] ] -- * Our work focuses on .blue1[interpretable, model-agnostic intrinsic importance] -- Example uses of .blue2[intrinsic] variable importance: * is it worth extracting text from notes in the EHR for the sake of predicting hospital readmission? -- * is it worth collecting a given covariate for the sake of predicting neutralization sensitivity? --- ## Case study: ANOVA importance Data unit `\((X, Y) \sim P_0\)` with: * outcome `\(Y\)` * covariate `\(X := (X_1, X_2, \ldots, X_p)\)` -- **Goals:** * .green[estimate] * .blue1[and do inference on] the importance of `\((X_j: j \in s)\)` in predicting `\(Y\)` -- How do we typically do this in **linear regression**? --- ## Case study: ANOVA importance How do we typically do this in **linear regression**? * Fit a linear regression of `\(Y\)` on `\(X\)` `\(\rightarrow \color{magenta}{\hat{\mu}(X)}\)` -- * Fit a linear regression of `\(Y\)` on `\(X_{-s}\)` `\(\rightarrow \color{magenta}{\hat{\mu}_s(X)}\)` -- * .green[Compare the fitted values] `\([\hat{\mu}(X_i), \hat{\mu}_s(X_i)]\)` -- Many ways to compare fitted values, including: * ANOVA decomposition * Difference in `\(R^2\)` --- ## Case study: ANOVA importance Difference in `\(R^2\)`: `$$\left[1 - \frac{n^{-1}\sum_{i=1}^n\{Y_i - \hat{\mu}(X_i)\}^2}{n^{-1}\sum_{i=1}^n\{Y_i - \overline{Y}_n\}^2}\right] - \left[1 - \frac{n^{-1}\sum_{i=1}^n\{Y_i - \hat{\mu}_s(X_i)\}^2}{n^{-1}\sum_{i=1}^n\{Y_i - \overline{Y}_n\}^2}\right]$$` -- ‍Inference: * Test difference * Valid confidence interval --- ## Case study: ANOVA importance Consider the .blue1[population parameter] `$$\psi_{0,s} = \frac{E_0\{\mu_0(X) - \mu_{0,s}(X)\}^2}{var_0(Y)}$$` * `\(\mu_0(x) := E_0(Y \mid X = x)\)` .blue1[(true conditional mean)] * `\(\mu_{0,s}(x) := E_0(Y \mid X_{-s} = x_{-s})\)` [for a vector `\(z\)`, `\(z_{-s}\)` represents `\((z_j: j \notin s)\)`] -- * .blue2[nonparametric extension] of linear regression-based ANOVA parameter -- * Can be expressed as a `\(\color{magenta}{\text{difference in population } R^2}\)` values, since `$$\color{magenta}{\psi_{0,s} = \left[1 - \frac{E_0\{Y - \mu_0(X)\}^2}{var_0(Y)}\right] - \left[1 - \frac{E_0\{Y - \mu_{0,s}(X)\}^2}{var_0(Y)}\right]}$$` --- ## Case study: ANOVA importance How should we make inference on `\(\psi_{0,s}\)`? -- 1. construct estimators `\(\mu_n\)`, `\(\mu_{n,s}\)` of `\(\mu_0\)` and `\(\mu_{0,s}\)` (e.g., with machine learning) -- 2. plug in: `$$\psi_{n,s} := \frac{\frac{1}{n}\sum_{i=1}^n \{\mu_n(X_i) - \mu_{n,s}(X_i)\}^2}{\frac{1}{n}\sum_{i=1}^n (Y_i - \overline{Y}_n)^2}$$` -- but this estimator has .red[asymptotic bias] -- 3. using influence function-based debiasing [e.g., Pfanzagl (1982)], we get estimator `$$\color{magenta}{\psi_{n,s}^* := \left[1 - \frac{\frac{1}{n}\sum_{i=1}^n\{Y_i - \mu_n(X_i)\}^2}{\frac{1}{n}\sum_{i=1}^n (Y_i - \overline{Y}_n)^2}\right] - \left[1 - \frac{\frac{1}{n}\sum_{i=1}^n\{Y_i - \mu_{n,s}(X_i)\}^2}{\frac{1}{n}\sum_{i=1}^n (Y_i - \overline{Y}_n)^2}\right]}$$` --- ## Case study: ANOVA importance `$$\color{magenta}{\psi_{n,s}^* := \left[1 - \frac{\frac{1}{n}\sum_{i=1}^n\{Y_i - \mu_n(X_i)\}^2}{\frac{1}{n}\sum_{i=1}^n (Y_i - \overline{Y}_n)^2}\right] - \left[1 - \frac{\frac{1}{n}\sum_{i=1}^n\{Y_i - \mu_{n,s}(X_i)\}^2}{\frac{1}{n}\sum_{i=1}^n (Y_i - \overline{Y}_n)^2}\right]}$$` Key observations: * `\(\psi_{n,s}^* =\)` plug-in estimator of `\(\psi_{0,s}\)` based on difference-in- `\(R^2\)` representation -- * .blue1[No need to debias] the difference-in- `\(R^2\)` estimator! -- * Why does this happen? .blue2[Estimation of] `\(\mu_{0}\)` .blue2[and] `\(\mu_{0,s}\)` .blue2[yields only second-order terms, so estimator behaves as if they are **known**] -- Under regularity conditions, `\(\psi_{n,s}^*\)` is consistent and nonparametric efficient. -- In particular, `\(\sqrt{n}(\psi_{n,s}^* - \psi_{0,s})\)` has a mean-zero normal limit with estimable variance. [Details in Williamson et al. (2020a)] --- ## Preparing for AMP <img src="img/amp.png" width="200px" style="display: block; margin: auto;" /> * 611 HIV-1 pseudoviruses * Outcome: neutralization sensitivity/resistance to antibody -- **Goal:** pre-screen features for inclusion in secondary analysis * 800 individual features, 13 groups of interest -- ‍Procedure: 1. Estimate `\(\mu_n\)`, `\(\mu_{n,s}\)` using Super Learner [van der Laan et al. (2007)] 2. Estimate and do inference on variable importance `\(\psi_{n,s}^*\)` .small[ [Details in Magaret et al. (2019) and Williamson et al. (2020b)] ] --- ## Preparing for AMP: SL performance .pull-left[ <img src="img/sl_perf_ic50.censored.png" width="100%" style="display: block; margin: auto;" /> ] -- .pull-right[ <img src="img/sl_roc_ic50.censored.png" width="600px" style="display: block; margin: auto;" /> ] --- ## Preparing for AMP: R-squared <img src="img/vim_ic50.censored_pres_r2_conditional_simple.png" height="480px" style="display: block; margin: auto;" /> --- ## Preparing for AMP: R-squared <img src="img/ROC_curve_with_Env_inset_v2.png" width="60%" style="display: block; margin: auto;" /> .small[ Magaret et al. (2019) ] --- ## Generalization to arbitrary measures ANOVA example suggests a natural generalization: -- * Choose a relevant measure of .blue1[predictiveness] for the task at hand -- * `\(V(f, P) =\)` .blue1[predictiveness] of function `\(f\)` under sampling from `\(P\)` * `\(\mathcal{F} =\)` rich class of candidate prediction functions * `\(\mathcal{F}_{-s} =\)` {all functions in `\(\mathcal{F}\)` that ignore components with index in `\(s\)`} `\(\subset \mathcal{F}\)` -- * Define the oracle prediction functions `\(f_0:=\)` maximizer of `\(V(f, P_0)\)` over `\(\mathcal{F}\)` & `\(f_{0,s}:=\)` maximizer of `\(V(f, P_0)\)` over `\(\mathcal{F}_{-s}\)` -- Define the importance of `\((X_j: j \in s)\)` relative to `\(X\)` as `$$\color{magenta}{\psi_{0,s} := V(f_0, P_0) - V(f_{0,s}, P_0) \geq 0}$$` --- ## Generalization to arbitrary measures Some examples of predictiveness measures: (arbitrary outcomes) ‍ `\(R^2\)`: `\(V(f, P) = 1 - E_P\{Y - f(X)\}^2 / var_P(Y)\)` -- (binary outcomes) Classification accuracy: `\(V(f, P) = P\{Y = f(X)\}\)` ‍AUC: `\(V(f, P) = P\{f(X_1) < f(X_2) \mid Y_1 = 0, Y_2 = 1\}\)` for `\((X_1, Y_1) \perp (X_2, Y_2)\)` Pseudo- `\(R^2\)` : `\(1 - \frac{E_P[Y \log f(X) - (1 - Y)\log \{1 - f(X)\}]}{P(Y = 1)\log P(Y = 1) + P(Y = 0)\log P(Y = 0)}\)` --- ## Generalization to arbitrary measures How should we make inference on `\(\psi_{0,s}\)`? -- 1. construct estimators `\(f_n\)`, `\(f_{n,s}\)` of `\(f_0\)` and `\(f_{0,s}\)` (e.g., with machine learning) -- 2. plug in: `$$\psi_{n,s}^* := V(f_n, P_n) - V(f_{n,s}, P_n)$$` where `\(P_n\)` is the empirical distribution based on the available data -- 3. Inference can be carried out using influence functions. -- Why? We can write `\(V(f_n, P_n) - V(f_{n,s}, P_n) \approx \color{green}{V(f_0, P_n) - V(f_0, P_0)} + \color{blue}{V(f_n, P_0) - V(f_0, P_0)}\)` -- * the `\(\color{green}{\text{green term}}\)` can be studied using the functional delta method * the `\(\color{blue}{\text{blue term}}\)` is second-order because `\(f_0\)` maximizes `\(V\)` over `\(\mathcal{F}\)` -- In other words: `\(f_0\)` and `\(f_{0,s}\)` **can be treated as known** in studying behavior of `\(\psi_{n,s}^*\)`! [Details in Williamson et al. (2020b)] --- ## Preparing for AMP: the full picture <img src="img/vim_ic50.censored_pres_r2_acc_auc_conditional_simple.png" height="480px" style="display: block; margin: auto;" /> --- ## Preparing for AMP: the full picture ‍Implications: -- * All sites in the VRC01 binding footprint, CD4 binding sites appear important -- * Results may differ based on chosen measure -- ‍Current work: -- * New analysis based on updated CATNAP database * Incorporating results from AMP primary analysis (to appear) --- ## Summary (so far) .blue1[Intrinsic importance = predictiveness potential] of variable (or set of variables). -- Simple estimators of importance are * .blue1[unbiased] and * .blue2[efficient] -- even if machine learning techniques are used. Implemented in packages `vimp` ([CRAN](https://cran.r-project.org/web/packages/vimp/), [GitHub](https://github.com/bdwilliamson/vimp)) and `vimpy` ([PyPI](https://pypi.org/project/vimpy/)) -- Several limitations: * current approach relies on .green[sample-splitting] for hypothesis testing * .red[highly correlated features] require additional care * .red[single bnAb not likely to confer full protection against HIV-1] --- ## Extension: correlated features So far: importance of `\((X_j: j \in s)\)` relative to `\(X\)` -- `\(\color{red}{\text{Potential issue}}\)`: correlated features ‍Example: two highly correlated features, age and foot size; predicting toddlers' reading ability -- * True importance of age = 0 (since foot size is in the model) * True importance of foot size = 0 (since age is in the model) -- ‍Idea: average contribution of a feature over all subsets! -- True importance of age = average(.blue1[increase in predictiveness from adding age to foot size] & .green[increase in predictiveness from using age over nothing]) -- Borrowed ideas from game theory to develop a subset-averaged framework Details in Williamson and Feng (2020) --- ## Extension: combination regimens ‍AMP: testing PE of a single bnAb Current regimens involve `\(>1\)` bnAb -- Key goal of trials networks: prioritizing combination bnAb regimens -- Developed software `SLAPNAP` ([GitHub](https://github.com/benkeser/slapnap), [DockerHub](https://hub.docker.com/r/slapnap/slapnap)) to aid this effort * Can help guide ranking of regimens by predicted PE * Can help guide secondary analyses in efficacy trials Details in Benkeser et al. (2020) --- ## Current and future work Collaborative science: * Longitudinal correlates of risk of hospitalized dengue disease in dengue vaccine trials * Prioritizing bnAb regimens for further clinical testing * Identifying metabolomic biomarkers as predictors of breast and colorectal cancer * Correlates of risk of COVID-19 in vaccine efficacy trials --- ## Current and future work Statistical methodology: -- * Variable selection: -- * Procedures that control family-wise error rate and false discovery rate -- * Handling complex missing data using imputation and weighting methods -- * Variable importance: -- * Studying and proposing further clinically useful measures -- * Improving power in hypothesis testing -- * Immune correlates of protection -- * Incorporating machine learning into phase 4 vaccine safety and effectiveness studies --- ## Closing thoughts .blue1[Population-based] variable importance: * wide variety of meaningful measures * simple estimators * machine learning okay * valid inference, testing Two interpretations: * conditional * subset-averaged .center[
https://github.com/bdwilliamson |
https://bdwilliamson.github.io ] --- ## References * .small[ Benkeser, DC, Williamson BD, Magaret CA, Nizam S, and Gilbert PB. 2020. Super LeArner Prediction of NAb Panels (SLAPNAP): A Containerized Tool for Predicting Combination Monoclonal Broadly Neutralizing Antibody Sensitivity. _BioRxiv technical report_.] * .small[ Breiman, L. 2001. Random forests. _Machine Learning_.] * .small[ Koff, WC and Berkley, SF. 2010. The Renaissance in HIV vaccine development -- Future Directions. _The New England Journal of Medicine_ ] * .small[ Magaret CA, Benkeser DC, Williamson BD, et al. 2019. Prediction of VRC01 neutralization sensitivity by HIV-1 gp160 sequence features. _PLoS Computational Biology_. ] * .small[ van der Laan, MJ. 2006. Statistical inference for variable importance. _The International Journal of Biostatistics_.] * .small[ van der Laan MJ, Polley EC, and Hubbard AE. 2007. Super Learner. _Statistical Applications in Genetics and Molecular Biology_. ] --- ## References * .small[ Williamson B, Gilbert P, Carone M, and Simon N. 2020a. Nonparametric variable importance assessment using machine learning techniques (+ rejoinder to discussion). _Biometrics_. ] * .small[ Williamson B, Gilbert P, Simon N, and Carone M. 2020b. A unified approach for inference on algorithm-agnostic variable importance. _ArXiv technical report_. ] * .small[ Williamson B and Feng J. 2020. Efficient nonparametric statistical inference on population feature importance using Shapley values. _ICML_. ] * .small[ Yoon H, Macke J, West AP, et al. 2015. CATNAP: a tool to compile, analyze and tally neutralizing antibody panels. _Nucleic Acids Research_. ]