R/extrinsic_selection.R
extrinsic_selection.Rd
Based on a fitted Super Learner ensemble, extract extrinsic variable importance estimates, rank them, and do variable selection using the specified rank threshold.
extrinsic_selection(
fit = NULL,
feature_names = "",
threshold = 20,
import_type = "all",
...
)
the fitted Super Learner ensemble.
the names of the features (a character vector of
length p
(the total number of features)); only used if the
fitted Super Learner ensemble was fit on a matrix
rather than on a
data.frame
, tibble
, etc.
the threshold for selection based on ranked variable importance; rank 1 is the most important. Defaults to 20 (though this is arbitrary, and really should be specified for the task at hand).
the type of extrinsic importance (either "all"
,
the default, for a weighted combination of the individual-algorithm importance;
or "best"
, for the importance from the algorithm with the highest
weight in the Super Learner).
other arguments to pass to algorithm-specific importance extractors.
a tibble with the estimated extrinsic variable importance, the corresponding variable importance ranks, and the selected variables.
SuperLearner
for specific usage of
the SuperLearner
function and package.
data("biomarkers")
# subset to complete cases for illustration
cc <- complete.cases(biomarkers)
dat_cc <- biomarkers[cc, ]
# use only the mucinous outcome, not the high-malignancy outcome
y <- dat_cc$mucinous
x <- dat_cc[, !(names(dat_cc) %in% c("mucinous", "high_malignancy"))]
feature_nms <- names(x)
# get the fit (using a simple library and 2 folds for illustration only)
library("SuperLearner")
set.seed(20231129)
fit <- SuperLearner::SuperLearner(Y = y, X = x, SL.library = c("SL.glm", "SL.mean"),
cvControl = list(V = 2))
#> Warning: prediction from rank-deficient fit; attr(*, "non-estim") has doubtful cases
#> Warning: prediction from rank-deficient fit; attr(*, "non-estim") has doubtful cases
# extract importance
importance <- extrinsic_selection(fit = fit, feature_names = feature_nms, threshold = 1.5,
import_type = "all")
importance
#> # A tibble: 22 × 3
#> feature rank selected
#> <chr> <dbl> <lgl>
#> 1 cea 11.5 FALSE
#> 2 cea_call 11.5 FALSE
#> 3 institution 11.5 FALSE
#> 4 lab1_actb 11.5 FALSE
#> 5 lab1_molecules_neoplasia_call 11.5 FALSE
#> 6 lab1_molecules_score 11.5 FALSE
#> 7 lab1_telomerase_neoplasia_call 11.5 FALSE
#> 8 lab1_telomerase_score 11.5 FALSE
#> 9 lab2_fluorescence_mucinous_call 11.5 FALSE
#> 10 lab2_fluorescence_score 11.5 FALSE
#> # ℹ 12 more rows