`R/vimp_rsquared.R`

`vimp_rsquared.Rd`

Compute estimates of and confidence intervals for nonparametric $R^2$-based
intrinsic variable importance. This is a wrapper function for `cv_vim`

,
with `type = "r_squared"`

.

```
vimp_rsquared(
Y = NULL,
X = NULL,
cross_fitted_f1 = NULL,
cross_fitted_f2 = NULL,
f1 = NULL,
f2 = NULL,
indx = 1,
V = 10,
run_regression = TRUE,
SL.library = c("SL.glmnet", "SL.xgboost", "SL.mean"),
alpha = 0.05,
delta = 0,
na.rm = FALSE,
final_point_estimate = "split",
cross_fitting_folds = NULL,
sample_splitting_folds = NULL,
stratified = FALSE,
C = rep(1, length(Y)),
Z = NULL,
ipc_weights = rep(1, length(Y)),
scale = "logit",
ipc_est_type = "aipw",
scale_est = TRUE,
cross_fitted_se = TRUE,
...
)
```

- Y
the outcome.

- X
the covariates. If

`type = "average_value"`

, then the exposure variable should be part of`X`

, with its name provided in`exposure_name`

.- cross_fitted_f1
the predicted values on validation data from a flexible estimation technique regressing Y on X in the training data. Provided as either (a) a vector, where each element is the predicted value when that observation is part of the validation fold; or (b) a list of length V, where each element in the list is a set of predictions on the corresponding validation data fold. If sample-splitting is requested, then these must be estimated specially; see Details. However, the resulting vector should be the same length as

`Y`

; if using a list, then the summed length of each element across the list should be the same length as`Y`

(i.e., each observation is included in the predictions).- cross_fitted_f2
the predicted values on validation data from a flexible estimation technique regressing either (a) the fitted values in

`cross_fitted_f1`

, or (b) Y, on X withholding the columns in`indx`

. Provided as either (a) a vector, where each element is the predicted value when that observation is part of the validation fold; or (b) a list of length V, where each element in the list is a set of predictions on the corresponding validation data fold. If sample-splitting is requested, then these must be estimated specially; see Details. However, the resulting vector should be the same length as`Y`

; if using a list, then the summed length of each element across the list should be the same length as`Y`

(i.e., each observation is included in the predictions).- f1
the fitted values from a flexible estimation technique regressing Y on X. If sample-splitting is requested, then these must be estimated specially; see Details. If

`cross_fitted_se = TRUE`

, then this argument is not used.- f2
the fitted values from a flexible estimation technique regressing either (a)

`f1`

or (b) Y on X withholding the columns in`indx`

. If sample-splitting is requested, then these must be estimated specially; see Details. If`cross_fitted_se = TRUE`

, then this argument is not used.- indx
the indices of the covariate(s) to calculate variable importance for; defaults to 1.

- V
the number of folds for cross-fitting, defaults to 5. If

`sample_splitting = TRUE`

, then a special type of`V`

-fold cross-fitting is done. See Details for a more detailed explanation.- run_regression
if outcome Y and covariates X are passed to

`vimp_accuracy`

, and`run_regression`

is`TRUE`

, then Super Learner will be used; otherwise, variable importance will be computed using the inputted fitted values.- SL.library
a character vector of learners to pass to

`SuperLearner`

, if`f1`

and`f2`

are Y and X, respectively. Defaults to`SL.glmnet`

,`SL.xgboost`

, and`SL.mean`

.- alpha
the level to compute the confidence interval at. Defaults to 0.05, corresponding to a 95% confidence interval.

- delta
the value of the \(\delta\)-null (i.e., testing if importance < \(\delta\)); defaults to 0.

- na.rm
should we remove NAs in the outcome and fitted values in computation? (defaults to

`FALSE`

)- final_point_estimate
if sample splitting is used, should the final point estimates be based on only the sample-split folds used for inference (

`"split"`

, the default), or should they instead be based on the full dataset (`"full"`

) or the average across the point estimates from each sample split (`"average"`

)? All three options result in valid point estimates -- sample-splitting is only required for valid inference.- cross_fitting_folds
the folds for cross-fitting. Only used if

`run_regression = FALSE`

.- sample_splitting_folds
the folds used for sample-splitting; these identify the observations that should be used to evaluate predictiveness based on the full and reduced sets of covariates, respectively. Only used if

`run_regression = FALSE`

.- stratified
if run_regression = TRUE, then should the generated folds be stratified based on the outcome (helps to ensure class balance across cross-validation folds)

- C
the indicator of coarsening (1 denotes observed, 0 denotes unobserved).

- Z
either (i) NULL (the default, in which case the argument

`C`

above must be all ones), or (ii) a character vector specifying the variable(s) among Y and X that are thought to play a role in the coarsening mechanism. To specify the outcome, use`"Y"`

; to specify covariates, use a character number corresponding to the desired position in X (e.g.,`"1"`

).- ipc_weights
weights for the computed influence curve (i.e., inverse probability weights for coarsened-at-random settings). Assumed to be already inverted (i.e., ipc_weights = 1 / [estimated probability weights]).

- scale
should CIs be computed on original ("identity") or another scale? (options are "log" and "logit")

- ipc_est_type
the type of procedure used for coarsened-at-random settings; options are "ipw" (for inverse probability weighting) or "aipw" (for augmented inverse probability weighting). Only used if

`C`

is not all equal to 1.- scale_est
should the point estimate be scaled to be greater than or equal to 0? Defaults to

`TRUE`

.- cross_fitted_se
should we use cross-fitting to estimate the standard errors (

`TRUE`

, the default) or not (`FALSE`

)?- ...
other arguments to the estimation tool, see "See also".

An object of classes `vim`

and `vim_rsquared`

.
See Details for more information.

We define the population variable importance measure (VIM) for the group of features (or single feature) \(s\) with respect to the predictiveness measure \(V\) by $$\psi_{0,s} := V(f_0, P_0) - V(f_{0,s}, P_0),$$ where \(f_0\) is the population predictiveness maximizing function, \(f_{0,s}\) is the population predictiveness maximizing function that is only allowed to access the features with index not in \(s\), and \(P_0\) is the true data-generating distribution.

Cross-fitted VIM estimates are computed differently if sample-splitting is requested versus if it is not. We recommend using sample-splitting in most cases, since only in this case will inferences be valid if the variable(s) of interest have truly zero population importance. The purpose of cross-fitting is to estimate \(f_0\) and \(f_{0,s}\) on independent data from estimating \(P_0\); this can result in improved performance, especially when using flexible learning algorithms. The purpose of sample-splitting is to estimate \(f_0\) and \(f_{0,s}\) on independent data; this allows valid inference under the null hypothesis of zero importance.

Without sample-splitting, cross-fitted VIM estimates are obtained by first splitting the data into \(K\) folds; then using each fold in turn as a hold-out set, constructing estimators \(f_{n,k}\) and \(f_{n,k,s}\) of \(f_0\) and \(f_{0,s}\), respectively on the training data and estimator \(P_{n,k}\) of \(P_0\) using the test data; and finally, computing $$\psi_{n,s} := K^{(-1)}\sum_{k=1}^K \{V(f_{n,k},P_{n,k}) - V(f_{n,k,s}, P_{n,k})\}.$$

With sample-splitting, cross-fitted VIM estimates are obtained by first splitting the data into \(2K\) folds. These folds are further divided into 2 groups of folds. Then, for each fold \(k\) in the first group, estimator \(f_{n,k}\) of \(f_0\) is constructed using all data besides the kth fold in the group (i.e., \((2K - 1)/(2K)\) of the data) and estimator \(P_{n,k}\) of \(P_0\) is constructed using the held-out data (i.e., \(1/2K\) of the data); then, computing $$v_{n,k} = V(f_{n,k},P_{n,k}).$$ Similarly, for each fold \(k\) in the second group, estimator \(f_{n,k,s}\) of \(f_{0,s}\) is constructed using all data besides the kth fold in the group (i.e., \((2K - 1)/(2K)\) of the data) and estimator \(P_{n,k}\) of \(P_0\) is constructed using the held-out data (i.e., \(1/2K\) of the data); then, computing $$v_{n,k,s} = V(f_{n,k,s},P_{n,k}).$$ Finally, $$\psi_{n,s} := K^{(-1)}\sum_{k=1}^K \{v_{n,k} - v_{n,k,s}\}.$$

See the paper by Williamson, Gilbert, Simon, and Carone for more
details on the mathematics behind the `cv_vim`

function, and the
validity of the confidence intervals.

In the interest of transparency, we return most of the calculations
within the `vim`

object. This results in a list including:

- s
the column(s) to calculate variable importance for

- SL.library
the library of learners passed to

`SuperLearner`

- full_fit
the fitted values of the chosen method fit to the full data (a list, for train and test data)

- red_fit
the fitted values of the chosen method fit to the reduced data (a list, for train and test data)

- est
the estimated variable importance

- naive
the naive estimator of variable importance

- eif
the estimated efficient influence function

- eif_full
the estimated efficient influence function for the full regression

- eif_reduced
the estimated efficient influence function for the reduced regression

- se
the standard error for the estimated variable importance

- ci
the \((1-\alpha) \times 100\)% confidence interval for the variable importance estimate

- test
a decision to either reject (TRUE) or not reject (FALSE) the null hypothesis, based on a conservative test

- p_value
a p-value based on the same test as

`test`

- full_mod
the object returned by the estimation procedure for the full data regression (if applicable)

- red_mod
the object returned by the estimation procedure for the reduced data regression (if applicable)

- alpha
the level, for confidence interval calculation

- sample_splitting_folds
the folds used for hypothesis testing

- cross_fitting_folds
the folds used for cross-fitting

- y
the outcome

- ipc_weights
the weights

- cluster_id
the cluster IDs

- mat
a tibble with the estimate, SE, CI, hypothesis testing decision, and p-value

`SuperLearner`

for specific usage of the
`SuperLearner`

function and package.

```
# generate the data
# generate X
p <- 2
n <- 100
x <- data.frame(replicate(p, stats::runif(n, -5, 5)))
# apply the function to the x's
smooth <- (x[,1]/5)^2*(x[,1]+7)/5 + (x[,2]/3)^2
# generate Y ~ Normal (smooth, 1)
y <- smooth + stats::rnorm(n, 0, 1)
# set up a library for SuperLearner; note simple library for speed
library("SuperLearner")
learners <- c("SL.glm", "SL.mean")
# estimate (with a small number of folds, for illustration only)
est <- vimp_rsquared(y, x, indx = 2,
alpha = 0.05, run_regression = TRUE,
SL.library = learners, V = 2, cvControl = list(V = 2))
#> Warning: Original estimate < 0; returning zero.
```