--- title: "Introduction to `flevr`" author: "Brian D. Williamson" output: rmarkdown::html_vignette vignette: > %\VignetteIndexEntry{Introduction to `flevr`} %\VignetteEngine{knitr::rmarkdown} %\VignetteEncoding{UTF-8} references: - id: vanderlaan2007 title: Super learner author: - family: van der Laan given: MJ - family: Polley given: EC - family: Hubbard given: AE volume: 6 publisher: Statistical Applications in Genetics and Molecular Biology type: article-journal issued: year: 2007 - id: williamson2020c title: Efficient nonparametric statistical inference on population feature importance using Shapley values author: - family: Williamson given: Brian D - family: Feng given: Jean publisher: arXiv type: article-journal issued: year: 2020 URL: https://arxiv.org/abs/2006.09481 --- ```{r, include = FALSE} knitr::opts_chunk$set( collapse = TRUE, comment = "#>" ) ``` ```{r setup} library(flevr) ``` ## Introduction `flevr` is a `R` package for doing variable selection based on flexible ensembles. The package provides functions for extrinsic variable selection using the [Super Learner](https://github.com/ecpolley/SuperLearner) and for intrinsic variable selection using the Shapley Population Variable Importance Measure ([SPVIM](https://github.com/bdwilliamson/vimp)). The author and maintainer of the `flevr` package is [Brian Williamson](https://bdwilliamson.github.io). For details on the method, check out our preprint. ## Installation You can install a development release of `flevr` from GitHub via `devtools` by running the following code: ```{r install, eval = FALSE} # install devtools if you haven't already # install.packages("devtools", repos = "https://cloud.r-project.org") devtools::install_github(repo = "bdwilliamson/flevr") ``` ## Quick start This section should serve as a quick guide to using the `flevr` package --- we will cover the main functions for doing extrinsic and intrinsic variable selection using a simulated data example. More details are given in the specific vignettes for [extrinsic selection](extrinsic_selection.html) and [intrinsic selection](intrinsic_selection.html). First, we create some data: ```{r gen-data} # generate the data -- note that this is a simple setting, for speed set.seed(4747) p <- 2 n <- 500 # generate features x <- replicate(p, stats::rnorm(n, 0, 1)) x_df <- as.data.frame(x) x_names <- names(x_df) # generate outcomes y <- 1 + 0.5 * x[, 1] + 0.75 * x[, 2] + stats::rnorm(n, 0, 1) ``` This creates a matrix of covariates `x` with 2 columns and a vector `y` of normally-distributed outcome values for a sample of `n = 500` study participants. There are two main types of variable selection available in `flevr`: extrinsic and intrinsic. Extrinsic selection is the most common type of variable selection: in this approach, a given algorithm (and perhaps its associated algorithm-specific variable importance) is used for variable selection. The lasso is a widely-used example of extrinsic selection. Intrinsic selection, on the other hand, uses estimated intrinsic variable importance (a population quantity) to perform variable selection. This intrinsic importance is both defined and estimated in a model-agnostic manner. ### Extrinsic variable selection We recommend using the Super Learner @ref(vanderlaan2007) to do extrinsic variable selection to protect against model misspecification; more details on this procedure are available in the vignette on [extrinsic selection](extrinsic_selection.html). This requires specifying a _library_ of _candidate learners_ (e.g., lasso, random forests). We can do this in `flevr` using the following code: ```{r sl-fit-and-imp} set.seed(1234) # fit a Super Learner ensemble; note its simplicity, for speed library("SuperLearner") learners <- c("SL.glm", "SL.mean") V <- 2 fit <- SuperLearner::SuperLearner(Y = y, X = x_df, SL.library = learners, cvControl = list(V = V)) # extract importance based on the whole Super Learner sl_importance_all <- extract_importance_SL( fit = fit, feature_names = x_names, import_type = "all" ) sl_importance_all ``` These results suggest that feature 2 is more important than feature 1 within the Super Learner ensemble (since a lower rank is better). If we want to scrutinize the importance of features within the best-fitting algorithm in the Super Learner ensemble, we can do the following: ```{r sl-best-alg} sl_importance_best <- extract_importance_SL( fit = fit, feature_names = x_names, import_type = "best" ) sl_importance_best ``` Finally, to do variable selection, we need to select a threshold (ideally before looking at the data). In this case, since there are only two variables, we choose a threshold of 1.5, which means we will select only one variable: ```{r extrinsic-selection} extrinsic_selected <- extrinsic_selection( fit = fit, feature_names = x_names, threshold = 1.5, import_type = "all" ) extrinsic_selected ``` In this case, we select only variable 2. ### Intrinsic variable selection Intrinsic variable selection is based on population variable importance @ref(williamson2020c); more details on this procedure are available in the vignette on [intrinsic selection](intrinsic_selection.html). Intrinsic selection also uses the Super Learner under the hood, and requires specifying a useful _measure of predictiveness_ (e.g., R-squared or classification accuracy). The first step in doing intrinsic selection is estimating the variable importance: ```{r fit-spvim} set.seed(1234) # set up a library for SuperLearner learners <- "SL.glm" univariate_learners <- "SL.glm" V <- 2 # estimate the SPVIMs library("vimp") est <- suppressWarnings( sp_vim(Y = y, X = x, V = V, type = "r_squared", SL.library = learners, gamma = .1, alpha = 0.05, delta = 0, cvControl = list(V = V), env = environment()) ) est ``` This procedure again shows (correctly) that variable 2 is more important than variable 1 in this population. The next step is to choose an error rate to control and a method for controlling the family-wise error rate. Here, we choose the generalized family-wise error rate to control overall and choose Holm-adjusted p-values to control the individual family-wise error rate: ```{r intrinsic-selection} intrinsic_set <- intrinsic_selection( spvim_ests = est, sample_size = n, alpha = 0.2, feature_names = x_names, control = list( quantity = "gFWER", base_method = "Holm", k = 1) ) intrinsic_set ``` In this case, we select both variables.