Flexible variable selection in the presence of missing data

class: center, middle, title-slide

# Flexible variable selection in the presence of missing data
### Brian D. Williamson, PhD<br> <span style="font-size: 75%;"> Kaiser Permanente Washington Health Research Institute </span>
### 13 June, 2022 <br> <a href = 'https://bdwilliamson.github.io/talks' style = 'color: white;'>https://bdwilliamson.github.io/talks</a>

---

## Acknowledgments

This work was done in collaboration with:
<img src="img/people1.PNG" width="75%" style="display: block; margin: auto;" />

Many thanks for support from NIGMS!

---

## Challenge: variable selection with missing data

Goals of variable selection:
* identify a .green[parsimonious] set of features
* achieve .blue1[desired level of performance]

How do we measure performance?
* .red[error rate control]: not too many unimportant variables were selected
* .blue2[predictiveness]: how well our parsimonious set predicts the outcome

Identifying and assessing sets of variables is .red[complex] with missing data

---

## Challenge: variable selection with missing data

Common approaches to handling missing data:
* model the likelihood .small[(Little & Schluchter, 1985; Garcia et al., 2010)];
* use inverse probability weighting .small[(Tsiatis, 2007; Bang & Robins, 2005; Johnson et al., 2008)];
* use multiple imputations .small[(Long & Johnson, 2015; Zhao & Long, 2017; Liu et al., 2019)]

Variable selection with missing data: uses "complete" data

Selection is typically based on a generalized linear model:
* lasso .small[(Tibshirani, 1996)] or SCAD .small[(Fan and Li, 2001)]
* knockoffs .small[(Barber and Candès, 2015)]
* stability selection .small[(Meinshausen and Bühlmann, 2010)]

Potential issue: .red[model misspecification]

---

## Flexible variable selection

**What is flexible variable selection?**
* Allow specification of flexible (i.e., nonlinear) algorithms
* e.g., trees, forests, ensembles

.green[Robustness to model misspecification] depends on the algorithms considered

Our work seeks to:
* combine multiple imputation with flexible variable selection
* provide error rate control for intrinsic selection procedures

---

## Variable importance in selection

A natural procedure for flexible selection uses .green[variable importance]:

Choose and estimate a measure of variable importance 
* .blue1[extrinsic importance] .small[(e.g., Breiman, 2001)]
* .blue1[intrinsic importance] .small[(e.g., Williamson & Feng, 2020)]

Select variables based on .blue2[decreasing importance]

Extrinsic importance `$\rightarrow$` extrinsic selection

Intrinsic importance `$\rightarrow$` intrinsic selection

---

## Intrinsic variable importance

Intrinsic variable importance: 
* Choose a relevant measure of .blue1[predictiveness] for the task at hand

* `$V(f, P) =$` .blue1[predictiveness] of function `$f$` under sampling from `$P$`
  * `$\mathcal{F} =$` rich class of candidate prediction functions
  * `$\mathcal{F}_{s} =$` {all functions in `$\mathcal{F}$` that use only components with index in `$s$`} `$\subset \mathcal{F}$`
  
--

* Define the oracle prediction functions

`$f_{0,s}:=$` maximizer of `$V(f, P_0)$` over `$\mathcal{F}_{s}$`

---

## Intrinsic variable importance

Some examples of predictiveness measures:

(arbitrary outcomes)

&zwj; `$R^2$`: `$V(f, P) = 1 - E_P\{Y - f(X)\}^2 / var_P(Y)$`

(binary outcomes)

Classification accuracy: `$V(f, P) = P\{Y = f(X)\}$`

&zwj;AUC: `$V(f, P) = P\{f(X_1) < f(X_2) \mid Y_1 = 0, Y_2 = 1\}$` for `$(X_1, Y_1) \perp (X_2, Y_2)$`

<blockquote> Inference can be carried out using influence functions .small[(Williamson et al., 2021)] </blockquote>

---

## Intrinsic variable selection

One possible measure: SPVIM .small[(Williamson & Feng, 2020)]
`$$\psi_{0,j} := \sum_{s \in \{1, \ldots, p\} \setminus \{j\}} \frac{1}{p}\binom{p - 1}{\lvert s \rvert}^{-1}\{V(f_{0,s\cup\{j\}}, P_0) - V(f_{0,s}, P_0)\},$$` where:
* `$V$` = predictiveness (e.g., `$R^2$`)
* `$f_{0,s} \in \text{arg max}_{f \in \mathcal{F}} V(f, P_0)$` for class of candidate functions `$\mathcal{F}_s$`;

<blockquote> If $\psi_{0,j} = 0$, then $X_j$ .red[does not improve] population prediction potential when added to .green[any subset] of the remaining features </blockquote>

&zwj;Idea: test variable importance for selection, i.e., test `$$H_0: \psi_{0,j} = 0 \text{ vs } H_1: \psi_{0,j} > 0 \text{ for each } j \in \{1, \ldots, p\}$$`

---

## Intrinsic variable selection

How can we select variables based on `$\psi_{0,j}$`?

1. Estimate `$\psi_0 = \{\psi_{0,j}\}_{j=1}^p$`, obtain p-values `$p_{n,j}$` for each test of zero importance

3. Compute .blue1[adjusted] p-values `$\tilde{p}_{n,j}$` to control the .blue2[family-wise error rate] (FWER)

4. Set `$S_n(\alpha) = \{j \in \{1, \ldots, p\}: \tilde{p}_{n,j} < \alpha\}$`

5. For `$k \in \{0, \ldots, p - \lvert S_n(\alpha)\rvert\}$`, determine .blue1[augmentation set] 
`$$A_n(k, \alpha) = \{s \subseteq S_n^c(\alpha): \tilde{p}_{n,\ell} \leq \tilde{p}_{n,(k)} \text{ for all } \ell \in s\}$$` 
  
  (if `$k = 0$`, `$A_n(k, \alpha) = \emptyset$`)

6. Final set of selected variables: `$S_n^+(k, \alpha) = S_n(\alpha) \cup A_n(k, \alpha)$`

---

## Intrinsic variable selection

The .blue1[augmentation set] allows for error control using a tuning parameter `$k \in \{0, \ldots, p - \lvert S_n(\alpha)\rvert\}$`

Examples of more general error rates:

* .blue2[generalized FWER]: probability of making more than `$k + 1$` type I errors

* .blue2[proportion of false positives] among the rejected variables greater than `$q(k) \in (0, 1)$`

* .blue2[false discovery rate]

`$k$` can be determined to .blue1[control one of these error rates]!

And the resulting procedure is .blue1[persistant]

.small[ [Details in Williamson and Huang (2022)] ]

---

## Intrinsic selection with missing data

How can we select variables based on `$\psi_{0,j}$` with missing data?

1. Multiply impute the data

2. Estimate `$\psi_0$` on each imputed dataset

3. Use Rubin's rules: combine estimates, standard errors; obtain p-values

4. Proceed as with complete data

---

## Numerical results

Selection procedures:
* lasso 
  * lasso + stability selection .small[ [Meinshausen and Buhlmann (2010)] ] 
  * lasso + knockoffs .small[ [Barber and Candes (2015)] ]

--
* intrinsic selection (SPVIM with augmentation)

Two settings: (binary outcome, continuous features, varying missing data)
* linear outcome-feature relationship (setting 1)
  * 6 important features (some very important, some weakly important)
  * `$p \in \{30, 500\}$`

* nonlinear outcome-feature relationship, correlated features (setting 2)
  * 3 important features (all equally weakly important)
  * `$p = 6$`
  
---

## Numerical results: setting 1

---

## Numerical results: setting 1

---

## Numerical results: setting 2

---

## Numerical results: setting 2

---

## Classifying pancreatic cysts

Cross-validated AUC of each procedure for selecting sets:
<img src="img/data-analysis_v2.png" width="960" style="display: block; margin: auto;" />

---

## Perspective and closing thoughts

* Error rate control possible with flexible intrinsic selection
* Computation time can be .red[quite large] depending on flexibility
* Choice of selection procedure likely depends on trade-off between:
  * computation time
  * desired sensitivity/specificity
* Important to use flexible final prediction algorithm regardless

* R package [`flevr`](https://github.com/bdwilliamson/flevr)

---

## References

.small[Many thanks for tuning in!]
.pull-left[
.smalllist[
* .tiny[Barber, R. and E.J. Candès (2015). Controlling the false discovery rate via knockoffs. *Annals of Statistics*.]
* .tiny[Bang, H. and J. Robins (2005). Doubly robust estimation in missing data and causal inference models. *Biometrics*.]
* .tiny[Breiman, L. (2001). Random forests. *Machine Learning*.]
* .tiny[Dudoit, S., J. Shaffer, and J. Boldrick (2003). Multiple hypothesis testing in microarray experiments. *Statistical Science*.]
* .tiny[Dudoit, S. and M.J. van der Laan (2008). *Multiple testing procedures with applications to genomics*. Springer Science & Business Media.]
* .tiny[Fan, J. and R. Li (2001). Variable selection via nonconcave penalized likelihood and its oracle properties. *Journal of the American Statistical Association*.]
* .tiny[ Garcia, R., J. Ibrahim, and H. Zhu (2010). Variable selection for regression models with missing data. *Statistica Sinica*. ]
* .tiny[Heymans, M., S. van Buuren, D. Knol, et al. (2007). Variable selection under multiple imputation using the bootstrap in a prognostic study. *BMC Medical Research Methodology*.]
* .tiny[Johnson, B., D. Lin, and D. Zeng (2008). Penalized estimating functions and variable selection in semiparametric regression models. *Journal of the American Statistical Association*.]
* .tiny[ Little, R. and M. Schluchter (1985). Maximum likelihood estimation for mixed continuous and categorical data with missing values. *Biometrika*.]
* .tiny[Liu, L., Y. Qiu, L. Natarajan, and K. Messer (2019). Imputation and post-selection inference in models with missing data: an application to colorectal cancer surveillance guidelines. *Annals of Applied Statistics*.]
]
]
.pull-right[
.smalllist[
* .tiny[Long, Q. and B. Johnson (2015). Variable selection in the presence of missing data: resampling and imputation. *Biostatistics*.]
* .tiny[Meinshausen, N. and P. Bühlmann (2010). Stability selection. *Journal of the Royal Statistical Society: Series B (Statistical Methodology)*.]
* .tiny[Tibshirani, R. (1996). Regression shrinkage and selection via the lasso. *Journal of the Royal Statistical Society: Series B (Statistical Methodology)*.]
* .tiny[Tsiatis, A. (2007). *Semiparametric theory and missing data*. Springer Science & Business Media.]
* .tiny[van der Laan, M.J., E. Polley, and A. Hubbard (2007). Super learner. *Statistical Applications in Genetics and Molecular Biology*.]
* .tiny[Williamson, B.D. and J. Feng (2020). Efficient nonparametric statistical inference on population feature importance using Shapley values. In *Proceedings of the 37th International Conference on Machine Learning*.]
* .tiny[ Williamson BD, Gilbert P, Simon N, and Carone M (2021). A general framework for inference on algorithm-agnostic variable importance. _Journal of the American Statistical Association_. ]
* .tiny[Williamson, B.D. and Y. Huang (2022). Flexible variable selection in the presence of missing data. *arXiv*.]
* .tiny[Zhao, Y. and Q. Long (2017). Variable selection in the presence of missing data: imputation-based methods. *Wiley Interdisciplinary Reviews: Computational Statistics*.]
]
]