flevr
flevr
is a R
package for doing variable
selection based on flexible ensembles. The package provides functions
for extrinsic variable selection using the Super Learner and
for intrinsic variable selection using the Shapley Population Variable
Importance Measure (SPVIM).
The author and maintainer of the flevr
package is Brian Williamson. For details
on the method, check out our preprint.
You can install a development release of flevr
from
GitHub via devtools
by running the following code:
This section should serve as a quick guide to using the
flevr
package — we will cover the main functions for doing
extrinsic and intrinsic variable selection using a simulated data
example. More details are given in the specific vignettes for extrinsic selection and intrinsic selection.
First, we create some data:
# generate the data -- note that this is a simple setting, for speed
set.seed(4747)
p <- 2
n <- 500
# generate features
x <- replicate(p, stats::rnorm(n, 0, 1))
x_df <- as.data.frame(x)
x_names <- names(x_df)
# generate outcomes
y <- 1 + 0.5 * x[, 1] + 0.75 * x[, 2] + stats::rnorm(n, 0, 1)
This creates a matrix of covariates x
with 2 columns and
a vector y
of normally-distributed outcome values for a
sample of n = 500
study participants.
There are two main types of variable selection available in
flevr
: extrinsic and intrinsic. Extrinsic selection is the
most common type of variable selection: in this approach, a given
algorithm (and perhaps its associated algorithm-specific variable
importance) is used for variable selection. The lasso is a widely-used
example of extrinsic selection. Intrinsic selection, on the other hand,
uses estimated intrinsic variable importance (a population quantity) to
perform variable selection. This intrinsic importance is both defined
and estimated in a model-agnostic manner.
We recommend using the Super Learner (ref?)(vanderlaan2007) to do
extrinsic variable selection to protect against model misspecification;
more details on this procedure are available in the vignette on extrinsic selection. This requires
specifying a library of candidate learners (e.g.,
lasso, random forests). We can do this in flevr
using the
following code:
set.seed(1234)
# fit a Super Learner ensemble; note its simplicity, for speed
library("SuperLearner")
learners <- c("SL.glm", "SL.mean")
V <- 2
fit <- SuperLearner::SuperLearner(Y = y, X = x_df,
SL.library = learners,
cvControl = list(V = V))
# extract importance based on the whole Super Learner
sl_importance_all <- extract_importance_SL(
fit = fit, feature_names = x_names, import_type = "all"
)
sl_importance_all
#> # A tibble: 2 × 2
#> feature rank
#> <chr> <dbl>
#> 1 V2 1.01
#> 2 V1 1.99
These results suggest that feature 2 is more important than feature 1 within the Super Learner ensemble (since a lower rank is better). If we want to scrutinize the importance of features within the best-fitting algorithm in the Super Learner ensemble, we can do the following:
sl_importance_best <- extract_importance_SL(
fit = fit, feature_names = x_names, import_type = "best"
)
sl_importance_best
#> # A tibble: 2 × 2
#> feature rank
#> <chr> <int>
#> 1 V2 1
#> 2 V1 2
Finally, to do variable selection, we need to select a threshold (ideally before looking at the data). In this case, since there are only two variables, we choose a threshold of 1.5, which means we will select only one variable:
extrinsic_selected <- extrinsic_selection(
fit = fit, feature_names = x_names, threshold = 1.5, import_type = "all"
)
extrinsic_selected
#> # A tibble: 2 × 3
#> feature rank selected
#> <chr> <dbl> <lgl>
#> 1 V2 1.01 TRUE
#> 2 V1 1.99 FALSE
In this case, we select only variable 2.
Intrinsic variable selection is based on population variable importance (ref?)(williamson2020c); more details on this procedure are available in the vignette on intrinsic selection. Intrinsic selection also uses the Super Learner under the hood, and requires specifying a useful measure of predictiveness (e.g., R-squared or classification accuracy). The first step in doing intrinsic selection is estimating the variable importance:
set.seed(1234)
# set up a library for SuperLearner
learners <- "SL.glm"
univariate_learners <- "SL.glm"
V <- 2
# estimate the SPVIMs
library("vimp")
est <- suppressWarnings(
sp_vim(Y = y, X = x, V = V, type = "r_squared",
SL.library = learners, gamma = .1, alpha = 0.05, delta = 0,
cvControl = list(V = V), env = environment())
)
est
#> Variable importance estimates:
#> Estimate SE 95% CI VIMP > 0 p-value
#> s = 1 0.1515809 0.06090463 [0.03221005, 0.2709518] TRUE 1.330062e-03
#> s = 2 0.2990449 0.06565597 [0.17036157, 0.4277282] TRUE 6.863052e-09
This procedure again shows (correctly) that variable 2 is more important than variable 1 in this population.
The next step is to choose an error rate to control and a method for controlling the family-wise error rate. Here, we choose the generalized family-wise error rate to control overall and choose Holm-adjusted p-values to control the individual family-wise error rate:
intrinsic_set <- intrinsic_selection(
spvim_ests = est, sample_size = n, alpha = 0.2, feature_names = x_names,
control = list( quantity = "gFWER", base_method = "Holm", k = 1)
)
intrinsic_set
#> # A tibble: 2 × 6
#> feature est p_value adjusted_p_value rank selected
#> <chr> <dbl> <dbl> <dbl> <dbl> <lgl>
#> 1 V1 0.152 0.00133 0.00133 2 TRUE
#> 2 V2 0.299 0.00000000686 0.0000000137 1 TRUE
In this case, we select both variables.