class: center, middle, title-slide # Inference for model-agnostic variable importance ### Brian D. Williamson, PhD
Kaiser Permanente Washington Health Research Institute
### 21 June, 2022
https://bdwilliamson.github.io/talks
--- <style type="text/css"> .remark-slide-content { font-size: 20px header-h2-font-size: 1.75rem; } </style> ## Acknowledgments This work was done in collaboration with: <img src="img/people1.PNG" width="65%" style="display: block; margin: auto;" /> <img src="img/people3.png" width="50%" style="display: block; margin: auto;" /> --- ## Variable importance: what and why **What is variable importance?** -- * .blue1[Quantification of "contributions" of a variable] (or a set of variables) -- Traditionally: contribution to .blue2[predictions] -- * Useful to distinguish between contributions of predictions... -- * (.blue1[extrinsic importance]) ... .blue1[by a given (possibly black-box) algorithm] .small[ [e.g., Breiman, (2001)] ] -- * (.blue1[intrinsic importance]) ... .blue1[by best possible (i.e., oracle) algorithm] .small[ [e.g., van der Laan (2006)] ] -- * Our work focuses on .blue1[interpretable, model-agnostic intrinsic importance] -- Example uses of .blue2[intrinsic] variable importance: * is it worth extracting text from notes in the EHR for the sake of predicting hospital readmission? -- * is it worth collecting a given covariate for the sake of predicting neutralization sensitivity? --- ## Case study: ANOVA importance Data unit `\((X, Y) \sim P_0\)` with: * outcome `\(Y\)` * covariate `\(X := (X_1, X_2, \ldots, X_p)\)` -- **Goals:** * .green[estimate] * .blue1[and do inference on] the importance of `\((X_j: j \in s)\)` in predicting `\(Y\)` -- How do we typically do this in **linear regression**? --- ## Case study: ANOVA importance How do we typically do this in **linear regression**? * Fit a linear regression of `\(Y\)` on `\(X\)` `\(\rightarrow \color{magenta}{\mu_n(X)}\)` -- * Fit a linear regression of `\(Y\)` on `\(X_{-s}\)` `\(\rightarrow \color{magenta}{\mu_{n,-s}(X)}\)` -- * .green[Compare the fitted values] `\([\mu_n(X_i), \mu_{n,-s}(X_i)]\)` -- Many ways to compare fitted values, including: * ANOVA decomposition * Difference in `\(R^2\)` --- ## Case study: ANOVA importance Difference in `\(R^2\)`: `$$\left[1 - \frac{n^{-1}\sum_{i=1}^n\{Y_i - \mu_n(X_i)\}^2}{n^{-1}\sum_{i=1}^n\{Y_i - \overline{Y}_n\}^2}\right] - \left[1 - \frac{n^{-1}\sum_{i=1}^n\{Y_i - \mu_{n,-s}(X_i)\}^2}{n^{-1}\sum_{i=1}^n\{Y_i - \overline{Y}_n\}^2}\right]$$` -- ‍Inference: * Test difference * Valid confidence interval --- ## Case study: ANOVA importance Consider the .blue1[population parameter] `$$\psi_{0,s} = \frac{E_0\{\mu_0(X) - \mu_{0,-s}(X)\}^2}{var_0(Y)}$$` * `\(\mu_0(x) := E_0(Y \mid X = x)\)` .blue1[(true conditional mean)] * `\(\mu_{0,-s}(x) := E_0(Y \mid X_{-s} = x_{-s})\)` [for a vector `\(z\)`, `\(z_{-s}\)` represents `\((z_j: j \notin s)\)`] -- * .blue2[nonparametric extension] of linear regression-based ANOVA parameter -- * Can be expressed as a `\(\color{magenta}{\text{difference in population } R^2}\)` values, since `$$\color{magenta}{\psi_{0,s} = \left[1 - \frac{E_0\{Y - \mu_0(X)\}^2}{var_0(Y)}\right] - \left[1 - \frac{E_0\{Y - \mu_{0,-s}(X)\}^2}{var_0(Y)}\right]}$$` --- ## Case study: ANOVA importance How should we make inference on `\(\psi_{0,s}\)`? -- 1. construct estimators `\(\mu_n\)`, `\(\mu_{n,-s}\)` of `\(\mu_0\)` and `\(\mu_{0,-s}\)` (e.g., with machine learning) -- 2. plug in: `$$\psi_{n,s} := \frac{\frac{1}{n}\sum_{i=1}^n \{\mu_n(X_i) - \mu_{n,-s}(X_i)\}^2}{\frac{1}{n}\sum_{i=1}^n (Y_i - \overline{Y}_n)^2}$$` -- but this estimator has .red[asymptotic bias] -- 3. using influence function-based debiasing [e.g., Pfanzagl (1982)], we get estimator `$$\color{magenta}{\psi_{n,s}^* := \left[1 - \frac{\frac{1}{n}\sum_{i=1}^n\{Y_i - \mu_n(X_i)\}^2}{\frac{1}{n}\sum_{i=1}^n (Y_i - \overline{Y}_n)^2}\right] - \left[1 - \frac{\frac{1}{n}\sum_{i=1}^n\{Y_i - \mu_{n,-s}(X_i)\}^2}{\frac{1}{n}\sum_{i=1}^n (Y_i - \overline{Y}_n)^2}\right]}$$` --- ## Case study: ANOVA importance `$$\color{magenta}{\psi_{n,s}^* := \left[1 - \frac{\frac{1}{n}\sum_{i=1}^n\{Y_i - \mu_n(X_i)\}^2}{\frac{1}{n}\sum_{i=1}^n (Y_i - \overline{Y}_n)^2}\right] - \left[1 - \frac{\frac{1}{n}\sum_{i=1}^n\{Y_i - \mu_{n,-s}(X_i)\}^2}{\frac{1}{n}\sum_{i=1}^n (Y_i - \overline{Y}_n)^2}\right]}$$` Key observations: * `\(\psi_{n,s}^* =\)` plug-in estimator of `\(\psi_{0,s}\)` based on difference-in- `\(R^2\)` representation -- * .blue1[No need to debias] the difference-in- `\(R^2\)` estimator! -- * Why does this happen? .blue2[Estimation of] `\(\mu_{0}\)` .blue2[and] `\(\mu_{0,-s}\)` .blue2[yields only second-order terms, so estimator behaves as if they are **known**] -- Under regularity conditions, `\(\psi_{n,s}^*\)` is consistent and nonparametric efficient. -- In particular, `\(\sqrt{n}(\psi_{n,s}^* - \psi_{0,s})\)` has a mean-zero normal limit with estimable variance. [Details in Williamson et al. (2020)] --- ## Preparing for AMP <img src="img/amp.png" width="200px" style="display: block; margin: auto;" /> * 611 HIV-1 pseudoviruses * Outcome: neutralization sensitivity/resistance to antibody -- **Goal:** pre-screen features for inclusion in secondary analysis * 800 individual features, 13 groups of interest -- ‍Procedure: 1. Estimate `\(\mu_n\)`, `\(\mu_{n,-s}\)` using Super Learner [van der Laan et al. (2007)] 2. Estimate and do inference on variable importance `\(\psi_{n,s}^*\)` .small[ [Details in Magaret et al. (2019) and Williamson et al. (2021b)] ] --- ## Preparing for AMP: R-squared <img src="img/vim_ic50.censored_pres_r2_conditional_simple.png" width="5249" height="480px" style="display: block; margin: auto;" /> --- ## Generalization to arbitrary measures ANOVA example suggests a natural generalization: -- * Choose a relevant measure of .blue1[predictiveness] for the task at hand -- * `\(V(f, P) =\)` .blue1[predictiveness] of function `\(f\)` under sampling from `\(P\)` * `\(\mathcal{F} =\)` rich class of candidate prediction functions * `\(\mathcal{F}_{-s} =\)` {all functions in `\(\mathcal{F}\)` that ignore components with index in `\(s\)`} `\(\subset \mathcal{F}\)` -- * Define the oracle prediction functions `\(f_0:=\)` maximizer of `\(V(f, P_0)\)` over `\(\mathcal{F}\)` & `\(f_{0,-s}:=\)` maximizer of `\(V(f, P_0)\)` over `\(\mathcal{F}_{-s}\)` -- Define the importance of `\((X_j: j \in s)\)` relative to `\(X\)` as `$$\color{magenta}{\psi_{0,s} := V(f_0, P_0) - V(f_{0,-s}, P_0) \geq 0}$$` --- ## Generalization to arbitrary measures Some examples of predictiveness measures: (arbitrary outcomes) ‍ `\(R^2\)`: `\(V(f, P) = 1 - E_P\{Y - f(X)\}^2 / var_P(Y)\)` -- (binary outcomes) Classification accuracy: `\(V(f, P) = P\{Y = f(X)\}\)` ‍AUC: `\(V(f, P) = P\{f(X_1) < f(X_2) \mid Y_1 = 0, Y_2 = 1\}\)` for `\((X_1, Y_1) \perp (X_2, Y_2)\)` Pseudo- `\(R^2\)` : `\(1 - \frac{E_P[Y \log f(X) - (1 - Y)\log \{1 - f(X)\}]}{P(Y = 1)\log P(Y = 1) + P(Y = 0)\log P(Y = 0)}\)` --- ## Generalization to arbitrary measures How should we make inference on `\(\psi_{0,s}\)`? -- 1. construct estimators `\(f_n\)`, `\(f_{n,-s}\)` of `\(f_0\)` and `\(f_{0,-s}\)` (e.g., with machine learning) -- 2. plug in: `$$\psi_{n,s}^* := V(f_n, P_n) - V(f_{n,-s}, P_n)$$` where `\(P_n\)` is the empirical distribution based on the available data -- 3. Inference can be carried out using influence functions. -- Why? We can write `\(V(f_n, P_n) - V(f_{0}, P_0) \approx \color{green}{V(f_0, P_n) - V(f_0, P_0)} + \color{blue}{V(f_n, P_0) - V(f_0, P_0)}\)` -- * the `\(\color{green}{\text{green term}}\)` can be studied using the functional delta method * the `\(\color{blue}{\text{blue term}}\)` is second-order because `\(f_0\)` maximizes `\(V\)` over `\(\mathcal{F}\)` -- In other words: `\(f_0\)` and `\(f_{0,-s}\)` **can be treated as known** in studying behavior of `\(\psi_{n,s}^*\)`! [Details in Williamson et al. (2021b)] --- ## Preparing for AMP: the full picture <img src="img/vim_ic50.censored_pres_r2_acc_auc_conditional_simple.png" width="5249" height="480px" style="display: block; margin: auto;" /> --- ## Importance of cross-fitting A key regularity condition: `\(f_n\)` and `\(f_{n,-s}\)` .red[not too complex] -- This condition can be removed using .blue1[cross-fitting] -- ‍Experiments: * 2 important features, `\(p - 2\)` noise features, binary outcome * Importance measured via accuracy * Scenario 1: `\(p = 4\)`, covariates independent * Scenario 2: `\(p \in \{50, 100, 200\}\)`, some covariates correlated --- ## Cross-fitting: Scenario 1 <img src="img/main_null_accuracy_3_p_4_boot_0.png" width="3600" height="360px" style="display: block; margin: auto;" /> --- ## Cross-fitting: Scenario 2 <img src="img/supplement_highdim_accuracy_1.png" width="600px" height="450px" style="display: block; margin: auto;" /> --- ## Extension: correlated features So far: importance of `\((X_j: j \in s)\)` relative to `\(X\)` -- `\(\color{red}{\text{Potential issue}}\)`: correlated features ‍Example: two highly correlated features, age and foot size; predicting toddlers' reading ability -- * True importance of age = 0 (since foot size is in the model) * True importance of foot size = 0 (since age is in the model) -- ‍Idea: average contribution of a feature over all subsets! -- True importance of age = average(.blue1[increase in predictiveness from adding age to foot size] & .green[increase in predictiveness from using age over nothing]) -- Borrowed ideas from game theory to develop a subset-averaged framework Details in Williamson and Feng (2020) --- ## Extension: longitudinal VIMs So far: cross-sectional variable importance Can we do inference on variable importance longitudinally? <img src="index_files/figure-html/longitudinal-vim-1.png" width="480px" height="360px" style="display: block; margin: auto;" /> --- ## Extension: longitudinal VIMs <img src="index_files/figure-html/longitudinal-vim-2-1.png" width="360px" height="240px" style="display: block; margin: auto;" /> .blue2[Yes!] And we can do inference on summaries of the time series: * .blue1[average importance] over a contiguous subset of time points * .green[linear trend] over a contiguous subset of time points * .blue2[area under the trajectory] over a contiguous subset of time points Inference relies on the functional delta method. Manuscript with details forthcoming. --- ## Closing thoughts .blue1[Population-based] variable importance: * wide variety of meaningful measures * simple estimators * machine learning okay * valid inference, testing Check out the software: * [R package `vimp`](https://github.com/bdwilliamson/vimp) * [Python package `vimpy`](https://github.com/bdwilliamson/vimpy) <svg viewBox="0 0 496 512" style="height:1em;position:relative;display:inline-block;top:.1em;" xmlns="http://www.w3.org/2000/svg"> <path d="M165.9 397.4c0 2-2.3 3.6-5.2 3.6-3.3.3-5.6-1.3-5.6-3.6 0-2 2.3-3.6 5.2-3.6 3-.3 5.6 1.3 5.6 3.6zm-31.1-4.5c-.7 2 1.3 4.3 4.3 4.9 2.6 1 5.6 0 6.2-2s-1.3-4.3-4.3-5.2c-2.6-.7-5.5.3-6.2 2.3zm44.2-1.7c-2.9.7-4.9 2.6-4.6 4.9.3 2 2.9 3.3 5.9 2.6 2.9-.7 4.9-2.6 4.6-4.6-.3-1.9-3-3.2-5.9-2.9zM244.8 8C106.1 8 0 113.3 0 252c0 110.9 69.8 205.8 169.5 239.2 12.8 2.3 17.3-5.6 17.3-12.1 0-6.2-.3-40.4-.3-61.4 0 0-70 15-84.7-29.8 0 0-11.4-29.1-27.8-36.6 0 0-22.9-15.7 1.6-15.4 0 0 24.9 2 38.6 25.8 21.9 38.6 58.6 27.5 72.9 20.9 2.3-16 8.8-27.1 16-33.7-55.9-6.2-112.3-14.3-112.3-110.5 0-27.5 7.6-41.3 23.6-58.9-2.6-6.5-11.1-33.3 2.6-67.9 20.9-6.5 69 27 69 27 20-5.6 41.5-8.5 62.8-8.5s42.8 2.9 62.8 8.5c0 0 48.1-33.6 69-27 13.7 34.7 5.2 61.4 2.6 67.9 16 17.7 25.8 31.5 25.8 58.9 0 96.5-58.9 104.2-114.8 110.5 9.2 7.9 17 22.9 17 46.4 0 33.7-.3 75.4-.3 83.6 0 6.5 4.6 14.4 17.3 12.1C428.2 457.8 496 362.9 496 252 496 113.3 383.5 8 244.8 8zM97.2 352.9c-1.3 1-1 3.3.7 5.2 1.6 1.6 3.9 2.3 5.2 1 1.3-1 1-3.3-.7-5.2-1.6-1.6-3.9-2.3-5.2-1zm-10.8-8.1c-.7 1.3.3 2.9 2.3 3.9 1.6 1 3.6.7 4.3-.7.7-1.3-.3-2.9-2.3-3.9-2-.6-3.6-.3-4.3.7zm32.4 35.6c-1.6 1.3-1 4.3 1.3 6.2 2.3 2.3 5.2 2.6 6.5 1 1.3-1.3.7-4.3-1.3-6.2-2.2-2.3-5.2-2.6-6.5-1zm-11.4-14.7c-1.6 1-1.6 3.6 0 5.9 1.6 2.3 4.3 3.3 5.6 2.3 1.6-1.3 1.6-3.9 0-6.2-1.4-2.3-4-3.3-5.6-2z"></path></svg> https://github.com/bdwilliamson | <svg viewBox="0 0 496 512" style="height:1em;position:relative;display:inline-block;top:.1em;" xmlns="http://www.w3.org/2000/svg"> <path d="M336.5 160C322 70.7 287.8 8 248 8s-74 62.7-88.5 152h177zM152 256c0 22.2 1.2 43.5 3.3 64h185.3c2.1-20.5 3.3-41.8 3.3-64s-1.2-43.5-3.3-64H155.3c-2.1 20.5-3.3 41.8-3.3 64zm324.7-96c-28.6-67.9-86.5-120.4-158-141.6 24.4 33.8 41.2 84.7 50 141.6h108zM177.2 18.4C105.8 39.6 47.8 92.1 19.3 160h108c8.7-56.9 25.5-107.8 49.9-141.6zM487.4 192H372.7c2.1 21 3.3 42.5 3.3 64s-1.2 43-3.3 64h114.6c5.5-20.5 8.6-41.8 8.6-64s-3.1-43.5-8.5-64zM120 256c0-21.5 1.2-43 3.3-64H8.6C3.2 212.5 0 233.8 0 256s3.2 43.5 8.6 64h114.6c-2-21-3.2-42.5-3.2-64zm39.5 96c14.5 89.3 48.7 152 88.5 152s74-62.7 88.5-152h-177zm159.3 141.6c71.4-21.2 129.4-73.7 158-141.6h-108c-8.8 56.9-25.6 107.8-50 141.6zM19.3 352c28.6 67.9 86.5 120.4 158 141.6-24.4-33.8-41.2-84.7-50-141.6h-108z"></path></svg> https://bdwilliamson.github.io --- ## References * .small[ Breiman L. 2001. Random forests. _Machine Learning_.] * .small[ Breiman L. 2001. Statistical modeling: the two cultures. _Statistical Science_.] * .small[ Chernozhukov V et al. 2018. Double/debiased machine learning for treatment and structural parameters. _The Econometrics Journal_.] * .small[ Magaret CA, Benkeser DC, Williamson BD, et al. 2019. Prediction of VRC01 neutralization sensitivity by HIV-1 gp160 sequence features. _PLoS Computational Biology_. ] * .small[ van der Laan MJ. 2006. Statistical inference for variable importance. _The International Journal of Biostatistics_.] * .small[ van der Laan MJ, Polley EC, and Hubbard AE. 2007. Super Learner. _Statistical Applications in Genetics and Molecular Biology_. ] * .small[ van der Laan MJ and Rose S. 2011. Targeted learning: causal inference for observational and experimental data. _Springer_.] --- ## References * .small[ Williamson BD, Magaret CA, Gilbert PB, Nizam S, Simmons C, and Benkeser DC. 2021a. Super LeArner Prediction of NAb Panels (SLAPNAP): a containerized tool for predicting combination monoclonal broadly neutralizing antibody sensitivity. _Bioinformatics_.] * .small[ Williamson BD, Gilbert P, Carone M, and Simon N. 2020. Nonparametric variable importance assessment using machine learning techniques (+ rejoinder to discussion). _Biometrics_. ] * .small[ Williamson BD, Gilbert P, Simon N, and Carone M. 2021b. A general framework for inference on algorithm-agnostic variable importance. _Journal of the American Statistical Association_. ] * .small[ Williamson BD and Feng J. 2020. Efficient nonparametric statistical inference on population feature importance using Shapley values. _ICML_. ]