Perform extrinsic, ensemble-based variable selection

Based on a fitted Super Learner ensemble, extract extrinsic variable importance estimates, rank them, and do variable selection using the specified rank threshold.

extrinsic_selection(
  fit = NULL,
  feature_names = "",
  threshold = 20,
  import_type = "all",
  ...
)

Arguments

fit: the fitted Super Learner ensemble.
feature_names: the names of the features (a character vector of length p (the total number of features)); only used if the fitted Super Learner ensemble was fit on a matrix rather than on a data.frame, tibble, etc.
threshold: the threshold for selection based on ranked variable importance; rank 1 is the most important. Defaults to 20 (though this is arbitrary, and really should be specified for the task at hand).
import_type: the type of extrinsic importance (either "all", the default, for a weighted combination of the individual-algorithm importance; or "best", for the importance from the algorithm with the highest weight in the Super Learner).
...: other arguments to pass to algorithm-specific importance extractors.

Value

a tibble with the estimated extrinsic variable importance, the corresponding variable importance ranks, and the selected variables.

Examples

data("biomarkers")
# subset to complete cases for illustration
cc <- complete.cases(biomarkers)
dat_cc <- biomarkers[cc, ]
# use only the mucinous outcome, not the high-malignancy outcome
y <- dat_cc$mucinous
x <- dat_cc[, !(names(dat_cc) %in% c("mucinous", "high_malignancy"))]
feature_nms <- names(x)
# get the fit (using a simple library and 2 folds for illustration only)
library("SuperLearner")
set.seed(20231129)
fit <- SuperLearner::SuperLearner(Y = y, X = x, SL.library = c("SL.glm", "SL.mean"), 
                                  cvControl = list(V = 2))
#> Warning: prediction from rank-deficient fit; attr(*, "non-estim") has doubtful cases
#> Warning: prediction from rank-deficient fit; attr(*, "non-estim") has doubtful cases
# extract importance
importance <- extrinsic_selection(fit = fit, feature_names = feature_nms, threshold = 1.5, 
                                  import_type = "all")
importance
#> # A tibble: 22 × 3
#>    feature                          rank selected
#>    <chr>                           <dbl> <lgl>   
#>  1 cea                              11.5 FALSE   
#>  2 cea_call                         11.5 FALSE   
#>  3 institution                      11.5 FALSE   
#>  4 lab1_actb                        11.5 FALSE   
#>  5 lab1_molecules_neoplasia_call    11.5 FALSE   
#>  6 lab1_molecules_score             11.5 FALSE   
#>  7 lab1_telomerase_neoplasia_call   11.5 FALSE   
#>  8 lab1_telomerase_score            11.5 FALSE   
#>  9 lab2_fluorescence_mucinous_call  11.5 FALSE   
#> 10 lab2_fluorescence_score          11.5 FALSE   
#> # ℹ 12 more rows

Perform extrinsic, ensemble-based variable selection

Arguments

Value

See also

Examples