library("vimp")
#> vimp version 2.3.4: Perform Inference on Algorithm-Agnostic Variable Importance
library("SuperLearner")
#> Loading required package: nnls
#> Loading required package: gam
#> Loading required package: splines
#> Loading required package: foreach
#> Loaded gam 1.22-5
#> Super Learner
#> Version: 2.0-29
#> Package created on 2024-02-06
In the main vignette, I
discussed variable importance defined using R-squared. I also mentioned
that all of the analyses were carried out using a condititonal
variable importance measure. In this document, I will discuss all three
types of variable importance that may be computed using
vimp
.
In general, I define variable importance as a function of the true
population distribution (denoted by
)
and a predictiveness measure
– large values of
are assumed to be better. Currently, the measures
implemented in vimp
are
,
classification accuracy, area under the receiver operating
characteristic curve (AUC), and deviance. For a fixed function
,
the predictiveness is given by
,
where large values imply that
is a good predictor of the outcome. The best possible prediction
function,
,
is the oracle model – i.e., the prediction function that I
would use if I had access to the distribution
.
Often,
is the true conditional mean (e.g., for
).
Then the total oracle predictiveness can be defined as
.
This is the best possible value of predictiveness.
I define variable importance measures (VIMs) as contrasts in oracle predictivness. The oracle models that I plug in determine what type of variable importance is being considered, as I outline below. For the remainder of this document, suppose that I have variables, and an index set of interest (containing some subset of the variables). Throughout this document, I will use the VRC01 data (Magaret, Benkeser, Williamson, et al. 2019), a subset of the data freely available from the Los Alamos National Laboratory’s Compile, Neutralize, and Tally Neutralizing Antibody Panels database. Information about these data is available here. Throughout, I will also use a simple library of learners for the Super Learner (this is for illustration only; in practice, I suggest using a large library of learners, as outlined in the main vignette). Finally, I will use the area under the receiver operating characteristic curve (AUC) to measure importance.
# read in the data
data("vrc01")
# subset to the columns of interest for this analysis
library("dplyr")
#>
#> Attaching package: 'dplyr'
#> The following objects are masked from 'package:stats':
#>
#> filter, lag
#> The following objects are masked from 'package:base':
#>
#> intersect, setdiff, setequal, union
library("tidyselect")
# retain only the columns of interest for this analysis
y <- vrc01$ic50.censored
X <- vrc01 %>%
select(starts_with("geog"), starts_with("subtype"), starts_with("length"))
learners <- "SL.glm"
The reduced oracle predictiveness is defined as , where is the best possible prediction function that does not use the covariates with index in . Then the conditional VIM is defined as This is the measure of importance that I estimated in the main vignette. To estimate the conditional VIM for family history of heart disease, I can use the following code:
The marginal oracle predictiveness is defined as , where is the best possible prediction function that only uses the covariates with index in . The null oracle predictiveness is defined as , where is the best possible prediction function that uses no covariates (i.e., is fitting the mean). Then the marginal VIM is defined as To estimate the marginal VIM for family history of heart disease, I can use the following code:
The Shapley population VIM (SPVIM) generalizes the marginal and
conditional VIMs by averaging over all possible subsets. More
specifically, the SPVIM for feature
is given by
this is the average gain in predictiveness from adding feature
to each possible grouping of the other features. To estimate the SPVIM
for family history of heart disease, I can use the following code (note
that sp_vim
returns VIM estimates for all features):
set.seed(91011)
all_vim_spvim <- sp_vim(Y = y, X = X, type = "auc", SL.library = learners, na.rm = TRUE, V = V, cvControl = sl_cvcontrol, env = environment())
In some cases, there may be confounding factors that you want to adjust for in all cases. For example, in HIV vaccine studies, we often adjust for baseline demographic variables, including age and behavioral factors. If this is the case, then the null predictiveness above can be modified to be , where is the index set of all confounders.
The three VIMs defined here may be different for a given feature of interest. Indeed, we can see this for whether or not subtype is 01_AE in the VRC01 data:
subtype_01_AE_cond
#> Variable importance estimates:
#> Estimate SE 95% CI VIMP > 0 p-value
#> s = 5 0.001778934 0.05614002 [2.19097e-30, 1] FALSE 0.4873607
subtype_01_AE_marg
#> Variable importance estimates:
#> Estimate SE 95% CI VIMP > 0 p-value
#> s = 1 0.01302083 0.1073992 [1.015723e-09, 0.9999942] FALSE 0.4517514
# note: need to look at row for s = 5
all_vim_spvim
#> Variable importance estimates:
#> Estimate SE 95% CI VIMP > 0 p-value
#> s = 1 0.000000000 0.021791723 [0, 0.021839226] FALSE 0.5000000
#> s = 2 0.011043008 0.017183290 [0, 0.044721637] FALSE 0.4557810
#> s = 3 0.010119884 0.028051877 [0, 0.065100552] FALSE 0.4600018
#> s = 4 0.012518761 0.034394897 [0, 0.079931520] FALSE 0.4509761
#> s = 5 0.000000000 0.013993958 [0, 0.025677066] FALSE 0.5000000
#> s = 6 0.000000000 0.006363417 [0, 0.004475701] FALSE 0.5000000
#> s = 7 0.000000000 0.009907345 [0, 0.017583780] FALSE 0.5000000
#> s = 8 0.015550728 0.019018377 [0, 0.052826062] FALSE 0.4379738
#> s = 9 0.000000000 0.023765091 [0, 0.035428336] FALSE 0.5000000
#> s = 10 0.000000000 0.018762896 [0, 0.033983213] FALSE 0.5000000
#> s = 11 0.014845056 0.018106896 [0, 0.050333920] FALSE 0.4407254
#> s = 12 0.034118865 0.033878081 [0, 0.100518684] FALSE 0.3683432
#> s = 13 0.000000000 0.013849446 [0, 0.021282723] FALSE 0.5000000
#> s = 14 0.007543575 0.023941056 [0, 0.054467183] FALSE 0.4699677
#> s = 15 0.000000000 0.015921007 [0, 0.009156896] FALSE 0.5000000
#> s = 16 0.043251122 0.065473115 [0, 0.171576069] FALSE 0.3454781
#> s = 17 0.035069436 0.064557522 [0, 0.161599853] FALSE 0.3735006
#> s = 18 0.044388824 0.051888790 [0, 0.146088984] FALSE 0.3366255
#> s = 19 0.001066722 0.020758848 [0, 0.041753317] FALSE 0.4957381
#> s = 20 0.028659642 0.054194553 [0, 0.134879013] FALSE 0.3933259
#> s = 21 0.017929705 0.040749285 [0, 0.097796836] FALSE 0.4308049
This is simply a function of the fact that the VIMs are different population parameters. All three likely provide useful information in practice:
To choose a VIM, identify which of these three (there may be more than one) that best addresses your scientific question.