Shapley Population Variable Importance Measure (SPVIM) Estimates and Inference

Compute estimates and confidence intervals for the SPVIMs, using cross-fitting.

sp_vim(
  Y = NULL,
  X = NULL,
  V = 5,
  type = "r_squared",
  SL.library = c("SL.glmnet", "SL.xgboost", "SL.mean"),
  univariate_SL.library = NULL,
  gamma = 1,
  alpha = 0.05,
  delta = 0,
  na.rm = FALSE,
  stratified = FALSE,
  verbose = FALSE,
  sample_splitting = TRUE,
  final_point_estimate = "split",
  C = rep(1, length(Y)),
  Z = NULL,
  ipc_scale = "identity",
  ipc_weights = rep(1, length(Y)),
  ipc_est_type = "aipw",
  scale = "identity",
  scale_est = TRUE,
  cross_fitted_se = TRUE,
  ...
)

Arguments

Y: the outcome.
X: the covariates. If type = "average_value", then the exposure variable should be part of X, with its name provided in exposure_name.
V: the number of folds for cross-fitting, defaults to 5. If sample_splitting = TRUE, then a special type of V-fold cross-fitting is done. See Details for a more detailed explanation.
type: the type of importance to compute; defaults to r_squared, but other supported options are auc, accuracy, deviance, and anova.
SL.library: a character vector of learners to pass to SuperLearner, if f1 and f2 are Y and X, respectively. Defaults to SL.glmnet, SL.xgboost, and SL.mean.
univariate_SL.library: (optional) a character vector of learners to pass to SuperLearner for estimating univariate regression functions. Defaults to SL.polymars
gamma: the fraction of the sample size to use when sampling subsets (e.g., gamma = 1 samples the same number of subsets as the sample size)
alpha: the level to compute the confidence interval at. Defaults to 0.05, corresponding to a 95% confidence interval.
delta: the value of the \(\delta\)-null (i.e., testing if importance < \(\delta\)); defaults to 0.
na.rm: should we remove NAs in the outcome and fitted values in computation? (defaults to FALSE)
stratified: if run_regression = TRUE, then should the generated folds be stratified based on the outcome (helps to ensure class balance across cross-validation folds)
verbose: should sp_vim and SuperLearner print out progress? (defaults to FALSE)
sample_splitting: should we use sample-splitting to estimate the full and reduced predictiveness? Defaults to TRUE, since inferences made using sample_splitting = FALSE will be invalid for variables with truly zero importance.
final_point_estimate: if sample splitting is used, should the final point estimates be based on only the sample-split folds used for inference ("split", the default), or should they instead be based on the full dataset ("full") or the average across the point estimates from each sample split ("average")? All three options result in valid point estimates – sample-splitting is only required for valid inference.
C: the indicator of coarsening (1 denotes observed, 0 denotes unobserved).
Z: either (i) NULL (the default, in which case the argument C above must be all ones), or (ii) a character vector specifying the variable(s) among Y and X that are thought to play a role in the coarsening mechanism. To specify the outcome, use "Y"; to specify covariates, use a character number corresponding to the desired position in X (e.g., "1").
ipc_scale: what scale should the inverse probability weight correction be applied on (if any)? Defaults to "identity". (other options are "log" and "logit")
ipc_weights: weights for the computed influence curve (i.e., inverse probability weights for coarsened-at-random settings). Assumed to be already inverted (i.e., ipc_weights = 1 / [estimated probability weights]).
ipc_est_type: the type of procedure used for coarsened-at-random settings; options are "ipw" (for inverse probability weighting) or "aipw" (for augmented inverse probability weighting). Only used if C is not all equal to 1.
scale: should CIs be computed on original ("identity") or another scale? (options are "log" and "logit")
scale_est: should the point estimate be scaled to be greater than or equal to 0? Defaults to TRUE.
cross_fitted_se: should we use cross-fitting to estimate the standard errors (TRUE, the default) or not (FALSE)?
...: other arguments to the estimation tool, see "See also".

Value

An object of class vim. See Details for more information.

Details

We define the SPVIM as the weighted average of the population difference in predictiveness over all subsets of features not containing feature \(j\).

This is equivalent to finding the solution to a population weighted least squares problem. This key fact allows us to estimate the SPVIM using weighted least squares, where we first sample subsets from the power set of all possible features using the Shapley sampling distribution; then use cross-fitting to obtain estimators of the predictiveness of each sampled subset; and finally, solve the least squares problem given in Williamson and Feng (2020).

See the paper by Williamson and Feng (2020) for more details on the mathematics behind this function, and the validity of the confidence intervals.

In the interest of transparency, we return most of the calculations within the vim object. This results in a list containing:

SL.library: the library of learners passed to SuperLearner
v: the estimated predictiveness measure for each sampled subset
fit_lst: the fitted values on the entire dataset from the chosen method for each sampled subset
preds_lst: the cross-fitted predicted values from the chosen method for each sampled subset
est: the estimated SPVIM value for each feature
ics: the influence functions for each sampled subset
var_v_contribs: the contibutions to the variance from estimating predictiveness
var_s_contribs: the contributions to the variance from sampling subsets
ic_lst: a list of the SPVIM influence function contributions
se: the standard errors for the estimated variable importance
ci: the \((1-\alpha) \times 100\)% confidence intervals based on the variable importance estimates
p_value: p-values for the null hypothesis test of zero importance for each variable
test_statistic: the test statistic for each null hypothesis test of zero importance
test: a hypothesis testing decision for each null hypothesis test (for each variable having zero importance)
gamma: the fraction of the sample size used when sampling subsets
alpha: the level, for confidence interval calculation
delta: the delta value used for hypothesis testing
y: the outcome
ipc_weights: the weights
scale: the scale on which CIs were computed
mat: - a tibble with the estimates, SEs, CIs, hypothesis testing decisions, and p-values

Examples

n <- 100
p <- 2
# generate the data
x <- data.frame(replicate(p, stats::runif(n, -5, 5)))

# apply the function to the x's
smooth <- (x[,1]/5)^2*(x[,1]+7)/5 + (x[,2]/3)^2

# generate Y ~ Normal (smooth, 1)
y <- as.matrix(smooth + stats::rnorm(n, 0, 1))

# set up a library for SuperLearner; note simple library for speed
library("SuperLearner")
learners <- c("SL.glm")

# -----------------------------------------
# using Super Learner (with a small number of CV folds,
# for illustration only)
# -----------------------------------------
set.seed(4747)
est <- sp_vim(Y = y, X = x, V = 2, type = "r_squared",
SL.library = learners, alpha = 0.05)
#> Warning: One or more original estimates < 0; returning zero for these indices.

Shapley Population Variable Importance Measure (SPVIM) Estimates and Inference

Arguments

Value

Details

See also

Examples