Nonparametric Intrinsic Variable Importance Estimates: R-squared

Compute estimates of and confidence intervals for nonparametric $R^2$-based intrinsic variable importance. This is a wrapper function for cv_vim, with type = "r_squared".

vimp_rsquared(
  Y = NULL,
  X = NULL,
  cross_fitted_f1 = NULL,
  cross_fitted_f2 = NULL,
  f1 = NULL,
  f2 = NULL,
  indx = 1,
  V = 10,
  run_regression = TRUE,
  SL.library = c("SL.glmnet", "SL.xgboost", "SL.mean"),
  alpha = 0.05,
  delta = 0,
  na.rm = FALSE,
  final_point_estimate = "split",
  cross_fitting_folds = NULL,
  sample_splitting_folds = NULL,
  stratified = FALSE,
  C = rep(1, length(Y)),
  Z = NULL,
  ipc_weights = rep(1, length(Y)),
  scale = "logit",
  ipc_est_type = "aipw",
  scale_est = TRUE,
  cross_fitted_se = TRUE,
  ...
)

Arguments

Y: the outcome.
X: the covariates. If type = "average_value", then the exposure variable should be part of X, with its name provided in exposure_name.
cross_fitted_f1: the predicted values on validation data from a flexible estimation technique regressing Y on X in the training data. Provided as either (a) a vector, where each element is the predicted value when that observation is part of the validation fold; or (b) a list of length V, where each element in the list is a set of predictions on the corresponding validation data fold. If sample-splitting is requested, then these must be estimated specially; see Details. However, the resulting vector should be the same length as Y; if using a list, then the summed length of each element across the list should be the same length as Y (i.e., each observation is included in the predictions).
cross_fitted_f2: the predicted values on validation data from a flexible estimation technique regressing either (a) the fitted values in cross_fitted_f1, or (b) Y, on X withholding the columns in indx. Provided as either (a) a vector, where each element is the predicted value when that observation is part of the validation fold; or (b) a list of length V, where each element in the list is a set of predictions on the corresponding validation data fold. If sample-splitting is requested, then these must be estimated specially; see Details. However, the resulting vector should be the same length as Y; if using a list, then the summed length of each element across the list should be the same length as Y (i.e., each observation is included in the predictions).
f1: the fitted values from a flexible estimation technique regressing Y on X. If sample-splitting is requested, then these must be estimated specially; see Details. If cross_fitted_se = TRUE, then this argument is not used.
f2: the fitted values from a flexible estimation technique regressing either (a) f1 or (b) Y on X withholding the columns in indx. If sample-splitting is requested, then these must be estimated specially; see Details. If cross_fitted_se = TRUE, then this argument is not used.
indx: the indices of the covariate(s) to calculate variable importance for; defaults to 1.
V: the number of folds for cross-fitting, defaults to 5. If sample_splitting = TRUE, then a special type of V-fold cross-fitting is done. See Details for a more detailed explanation.
run_regression: if outcome Y and covariates X are passed to vimp_accuracy, and run_regression is TRUE, then Super Learner will be used; otherwise, variable importance will be computed using the inputted fitted values.
SL.library: a character vector of learners to pass to SuperLearner, if f1 and f2 are Y and X, respectively. Defaults to SL.glmnet, SL.xgboost, and SL.mean.
alpha: the level to compute the confidence interval at. Defaults to 0.05, corresponding to a 95% confidence interval.
delta: the value of the $\delta$-null (i.e., testing if importance < $\delta$); defaults to 0.
na.rm: should we remove NAs in the outcome and fitted values in computation? (defaults to FALSE)
final_point_estimate: if sample splitting is used, should the final point estimates be based on only the sample-split folds used for inference ("split", the default), or should they instead be based on the full dataset ("full") or the average across the point estimates from each sample split ("average")? All three options result in valid point estimates – sample-splitting is only required for valid inference.
cross_fitting_folds: the folds for cross-fitting. Only used if run_regression = FALSE.
sample_splitting_folds: the folds used for sample-splitting; these identify the observations that should be used to evaluate predictiveness based on the full and reduced sets of covariates, respectively. Only used if run_regression = FALSE.
stratified: if run_regression = TRUE, then should the generated folds be stratified based on the outcome (helps to ensure class balance across cross-validation folds)
C: the indicator of coarsening (1 denotes observed, 0 denotes unobserved).
Z: either (i) NULL (the default, in which case the argument C above must be all ones), or (ii) a character vector specifying the variable(s) among Y and X that are thought to play a role in the coarsening mechanism. To specify the outcome, use "Y"; to specify covariates, use a character number corresponding to the desired position in X (e.g., "1").
ipc_weights: weights for the computed influence curve (i.e., inverse probability weights for coarsened-at-random settings). Assumed to be already inverted (i.e., ipc_weights = 1 / [estimated probability weights]).
scale: should CIs be computed on original ("identity") or another scale? (options are "log" and "logit")
ipc_est_type: the type of procedure used for coarsened-at-random settings; options are "ipw" (for inverse probability weighting) or "aipw" (for augmented inverse probability weighting). Only used if C is not all equal to 1.
scale_est: should the point estimate be scaled to be greater than or equal to 0? Defaults to TRUE.
cross_fitted_se: should we use cross-fitting to estimate the standard errors (TRUE, the default) or not (FALSE)?
...: other arguments to the estimation tool, see "See also".

Value

An object of classes vim and vim_rsquared. See Details for more information.

Details

We define the population variable importance measure (VIM) for the group of features (or single feature) $s$ with respect to the predictiveness measure $V$ by $$\psi_{0,s} := V(f_0, P_0) - V(f_{0,s}, P_0),$$ where $f_0$ is the population predictiveness maximizing function, $f_{0,s}$ is the population predictiveness maximizing function that is only allowed to access the features with index not in $s$, and $P_0$ is the true data-generating distribution.

Cross-fitted VIM estimates are computed differently if sample-splitting is requested versus if it is not. We recommend using sample-splitting in most cases, since only in this case will inferences be valid if the variable(s) of interest have truly zero population importance. The purpose of cross-fitting is to estimate $f_0$ and $f_{0,s}$ on independent data from estimating $P_0$; this can result in improved performance, especially when using flexible learning algorithms. The purpose of sample-splitting is to estimate $f_0$ and $f_{0,s}$ on independent data; this allows valid inference under the null hypothesis of zero importance.

Without sample-splitting, cross-fitted VIM estimates are obtained by first splitting the data into $K$ folds; then using each fold in turn as a hold-out set, constructing estimators $f_{n,k}$ and $f_{n,k,s}$ of $f_0$ and $f_{0,s}$, respectively on the training data and estimator $P_{n,k}$ of $P_0$ using the test data; and finally, computing $$\psi_{n,s} := K^{(-1)}\sum_{k=1}^K \{V(f_{n,k},P_{n,k}) - V(f_{n,k,s}, P_{n,k})\}.$$

With sample-splitting, cross-fitted VIM estimates are obtained by first splitting the data into $2K$ folds. These folds are further divided into 2 groups of folds. Then, for each fold $k$ in the first group, estimator $f_{n,k}$ of $f_0$ is constructed using all data besides the kth fold in the group (i.e., $(2K - 1)/(2K)$ of the data) and estimator $P_{n,k}$ of $P_0$ is constructed using the held-out data (i.e., $1/2K$ of the data); then, computing $$v_{n,k} = V(f_{n,k},P_{n,k}).$$ Similarly, for each fold $k$ in the second group, estimator $f_{n,k,s}$ of $f_{0,s}$ is constructed using all data besides the kth fold in the group (i.e., $(2K - 1)/(2K)$ of the data) and estimator $P_{n,k}$ of $P_0$ is constructed using the held-out data (i.e., $1/2K$ of the data); then, computing $$v_{n,k,s} = V(f_{n,k,s},P_{n,k}).$$ Finally, $$\psi_{n,s} := K^{(-1)}\sum_{k=1}^K \{v_{n,k} - v_{n,k,s}\}.$$

See the paper by Williamson, Gilbert, Simon, and Carone for more details on the mathematics behind the cv_vim function, and the validity of the confidence intervals.

In the interest of transparency, we return most of the calculations within the vim object. This results in a list including:

s: the column(s) to calculate variable importance for
SL.library: the library of learners passed to SuperLearner
full_fit: the fitted values of the chosen method fit to the full data (a list, for train and test data)
red_fit: the fitted values of the chosen method fit to the reduced data (a list, for train and test data)
est: the estimated variable importance
naive: the naive estimator of variable importance
eif: the estimated efficient influence function
eif_full: the estimated efficient influence function for the full regression
eif_reduced: the estimated efficient influence function for the reduced regression
se: the standard error for the estimated variable importance
ci: the $(1-\alpha) \times 100$% confidence interval for the variable importance estimate
test: a decision to either reject (TRUE) or not reject (FALSE) the null hypothesis, based on a conservative test
p_value: a p-value based on the same test as test
full_mod: the object returned by the estimation procedure for the full data regression (if applicable)
red_mod: the object returned by the estimation procedure for the reduced data regression (if applicable)
alpha: the level, for confidence interval calculation
sample_splitting_folds: the folds used for hypothesis testing
cross_fitting_folds: the folds used for cross-fitting
y: the outcome
ipc_weights: the weights
cluster_id: the cluster IDs
mat: a tibble with the estimate, SE, CI, hypothesis testing decision, and p-value

Examples

# generate the data
# generate X
p <- 2
n <- 100
x <- data.frame(replicate(p, stats::runif(n, -5, 5)))

# apply the function to the x's
smooth <- (x[,1]/5)^2*(x[,1]+7)/5 + (x[,2]/3)^2

# generate Y ~ Normal (smooth, 1)
y <- smooth + stats::rnorm(n, 0, 1)

# set up a library for SuperLearner; note simple library for speed
library("SuperLearner")
learners <- c("SL.glm", "SL.mean")

# estimate (with a small number of folds, for illustration only)
est <- vimp_rsquared(y, x, indx = 2,
           alpha = 0.05, run_regression = TRUE,
           SL.library = learners, V = 2, cvControl = list(V = 2))

Nonparametric Intrinsic Variable Importance Estimates: R-squared

Arguments

Value

Details

See also

Examples