---
title: "Introduction to `flevr`"
author: "Brian D. Williamson"
output: rmarkdown::html_vignette
vignette: >
  %\VignetteIndexEntry{Introduction to `flevr`}
  %\VignetteEngine{knitr::rmarkdown}
  %\VignetteEncoding{UTF-8}
references:
- id: vanderlaan2007
  title: Super learner
  author:
  - family: van der Laan
    given: MJ
  - family: Polley
    given: EC
  - family: Hubbard
    given: AE
  volume: 6
  publisher: Statistical Applications in Genetics and Molecular Biology
  type: article-journal
  issued:
   year: 2007
- id: williamson2020c
  title: Efficient nonparametric statistical inference on population feature importance using Shapley values
  author:
  - family: Williamson
    given: Brian D
  - family: Feng
    given: Jean
  publisher: arXiv
  type: article-journal
  issued:
   year: 2020
  URL: https://arxiv.org/abs/2006.09481
---

```{r, include = FALSE}
knitr::opts_chunk$set(
  collapse = TRUE,
  comment = "#>"
)
```

```{r setup}
library(flevr)
```

## Introduction

`flevr` is a `R` package for doing variable selection based on flexible ensembles. The package provides functions for extrinsic variable selection using the [Super Learner](https://github.com/ecpolley/SuperLearner) and for intrinsic variable selection using the Shapley Population Variable Importance Measure ([SPVIM](https://github.com/bdwilliamson/vimp)).

The author and maintainer of the `flevr` package is  [Brian Williamson](https://bdwilliamson.github.io). For details on the method, check out our preprint.

## Installation

You can install a development release of `flevr` from GitHub via `devtools` by running the following code:
```{r install, eval = FALSE}
# install devtools if you haven't already
# install.packages("devtools", repos = "https://cloud.r-project.org")
devtools::install_github(repo = "bdwilliamson/flevr")
```

## Quick start

This section should serve as a quick guide to using the `flevr` package --- we will cover the main functions for doing extrinsic and intrinsic variable selection using a simulated data example. More details are given in the specific vignettes for [extrinsic selection](extrinsic_selection.html) and [intrinsic selection](intrinsic_selection.html). 

First, we create some data:
```{r gen-data}
# generate the data -- note that this is a simple setting, for speed
set.seed(4747)
p <- 2
n <- 500
# generate features
x <- replicate(p, stats::rnorm(n, 0, 1))
x_df <- as.data.frame(x)
x_names <- names(x_df)
# generate outcomes
y <- 1 + 0.5 * x[, 1] + 0.75 * x[, 2] + stats::rnorm(n, 0, 1)
```

This creates a matrix of covariates `x` with 2 columns and a vector `y` of normally-distributed outcome values for a sample of `n = 500` study participants.

There are two main types of variable selection available in `flevr`: extrinsic and intrinsic. Extrinsic selection is the most common type of variable selection: in this approach, a given algorithm (and perhaps its associated algorithm-specific variable importance) is used for variable selection. The lasso is a widely-used example of extrinsic selection. Intrinsic selection, on the other hand, uses estimated intrinsic variable importance (a population quantity) to perform variable selection. This intrinsic importance is both defined and estimated in a model-agnostic manner.

### Extrinsic variable selection

We recommend using the Super Learner @ref(vanderlaan2007) to do extrinsic variable selection to protect against model misspecification; more details on this procedure are available in the vignette on [extrinsic selection](extrinsic_selection.html). This requires specifying a _library_ of _candidate learners_ (e.g., lasso, random forests). We can do this in `flevr` using the following code:
```{r sl-fit-and-imp}
set.seed(1234)
# fit a Super Learner ensemble; note its simplicity, for speed
library("SuperLearner")
learners <- c("SL.glm", "SL.mean")
V <- 2
fit <- SuperLearner::SuperLearner(Y = y, X = x_df,
                                  SL.library = learners,
                                  cvControl = list(V = V))
# extract importance based on the whole Super Learner
sl_importance_all <- extract_importance_SL(
  fit = fit, feature_names = x_names, import_type = "all"
)
sl_importance_all
```
These results suggest that feature 2 is more important than feature 1 within the Super Learner ensemble (since a lower rank is better). If we want to scrutinize the importance of features within the best-fitting algorithm in the Super Learner ensemble, we can do the following:
```{r sl-best-alg}
sl_importance_best <- extract_importance_SL(
  fit = fit, feature_names = x_names, import_type = "best"
)
sl_importance_best
```
Finally, to do variable selection, we need to select a threshold (ideally before looking at the data). In this case, since there are only two variables, we choose a threshold of 1.5, which means we will select only one variable:
```{r extrinsic-selection}
extrinsic_selected <- extrinsic_selection(
  fit = fit, feature_names = x_names, threshold = 1.5, import_type = "all"
)
extrinsic_selected
```
In this case, we select only variable 2.

### Intrinsic variable selection

Intrinsic variable selection is based on population variable importance @ref(williamson2020c); more details on this procedure are available in the vignette on [intrinsic selection](intrinsic_selection.html). Intrinsic selection also uses the Super Learner under the hood, and requires specifying a useful _measure of predictiveness_ (e.g., R-squared or classification accuracy). The first step in doing intrinsic selection is estimating the variable importance:
```{r fit-spvim}
set.seed(1234)

# set up a library for SuperLearner
learners <- "SL.glm"
univariate_learners <- "SL.glm"
V <- 2

# estimate the SPVIMs
library("vimp")
est <- suppressWarnings(
  sp_vim(Y = y, X = x, V = V, type = "r_squared",
              SL.library = learners, gamma = .1, alpha = 0.05, delta = 0,
              cvControl = list(V = V), env = environment())
)
est
```
This procedure again shows (correctly) that variable 2 is more important than variable 1 in this population. 

The next step is to choose an error rate to control and a method for controlling the family-wise error rate. Here, we choose the generalized family-wise error rate to control overall and choose Holm-adjusted p-values to control the individual family-wise error rate: 
```{r intrinsic-selection}
intrinsic_set <- intrinsic_selection(
  spvim_ests = est, sample_size = n, alpha = 0.2, feature_names = x_names,
  control = list( quantity = "gFWER", base_method = "Holm", k = 1)
)
intrinsic_set
```
In this case, we select both variables.