class: center, middle, title-slide # Statistical inference and containerization in computational pipelines ### Brian D. Williamson, PhD
Fred Hutchinson Cancer Research Center
### 06 January, 2021
https://bdwilliamson.github.io/#talks
--- <style type="text/css"> .remark-slide-content { font-size: 22px } .remark-code, .remark-inline-code { font-size: 18px; } </style> ## Acknowledgments This work was done in collaboration with: <img src="img/slides_images_part1.png" width="75%" style="display: block; margin: auto;" /> with many thanks to: <img src="img/slides_images_part2.png" width="75%" style="display: block; margin: auto;" /> --- ## Motivation .pull-left[ <img src="img/amp.png" width="75%" style="display: block; margin: auto;" /> .small[ NIH NIAID-sponsored HVTN + HPTN phase 2b HIV prevention efficacy trial; statistical design described in Gilbert et al. (2017) _Stat Commun Infect Dis_ ] <img src="img/neutralized_hiv.png" width="75%" style="display: block; margin: auto;" /> ] .pull-right[ <img src="img/pred_prob_resistance_slapnap.png" width="90%" style="display: block; margin: auto;" /> .small[ Magaret, Benkeser, Williamson et al. (2019) _PLoS Comput Biol_ ] ] --- ## Snapshot: HVTN + HPTN planned bnAb clinical testing\* <table class="table" style="font-size: 18px; margin-left: auto; margin-right: auto;"> <thead> <tr> <th style="text-align:left;"> bnAb regimen </th> <th style="text-align:left;"> Clinical Trial(s) </th> <th style="text-align:right;"> Publicly-available observations </th> </tr> </thead> <tbody> <tr> <td style="text-align:left;"> VRC01 </td> <td style="text-align:left;"> HVTN 703,704/HPTN 081,085 </td> <td style="text-align:right;"> 828 </td> </tr> <tr> <td style="text-align:left;"> VRC07-523-LS </td> <td style="text-align:left;"> HVTN 121, 127/HPTN 087 </td> <td style="text-align:right;"> 402 </td> </tr> <tr> <td style="text-align:left;"> PGT121 </td> <td style="text-align:left;"> HVTN 130/HPTN 089 </td> <td style="text-align:right;"> 656 </td> </tr> <tr> <td style="text-align:left;"> VRC07-523-LS + PGDM1400 </td> <td style="text-align:left;"> HVTN 130/HPTN 089 </td> <td style="text-align:right;"> 402 </td> </tr> <tr> <td style="text-align:left;"> VRC07-523-LS + PGT121 </td> <td style="text-align:left;"> HVTN 130/HPTN 089 and HVTN 136/HPTN 092 </td> <td style="text-align:right;"> 402 </td> </tr> <tr> <td style="text-align:left;"> VRC07-523-LS + 10-1074 </td> <td style="text-align:left;"> HVTN 130/HPTN 089 </td> <td style="text-align:right;"> 402 </td> </tr> <tr> <td style="text-align:left;"> VRC07-523-LS + PGT121 + PGDM1400 </td> <td style="text-align:left;"> HVTN 130/HPTN 089 </td> <td style="text-align:right;"> 402 </td> </tr> <tr> <td style="text-align:left;"> VRC01/PGDM1400-10E8v4 (trispecific) </td> <td style="text-align:left;"> HVTN 129/HPTN 088 </td> <td style="text-align:right;"> 205 </td> </tr> </tbody> </table> \* As of January 2020 --- ## Critical issues in bnAb HIV prevention research For a given bnAb regimen (e.g., 1-, 2-, 3-bnAb, multispecific): -- * what is the regimen's neutralization breadth*? -- * what is the regimen's neutralization potency*? -- .small[ *against HIV-1 viruses circulating in the relevant population] -- We can address these questions with available data and machine learning. --- ## Data structure and notation In general: data `\(Z_1, \ldots, Z_n \stackrel{iid}{\sim} P_0\)` -- * `\(Z_i := (X_i, Y_i);\)` -- * `\(X_i \in \mathbb{R}^p\)` is a vector of covariates; -- * `\(Y_i \in \mathbb{R}\)` is the outcome of interest; -- ‍**Goals**: 1. Do a .green[good job] predicting `\(Y\)` using `\(X\)` -- 2. Understand how the features `\(\{X_j\}_{j = 1}^p\)` .green[affect] predictions -- 3. Ensure approaches to (1) and (2) are .green[reproducible] --- ## Data structure and notation ‍**Goals**: 1. Do a .green[good job] predicting `\(Y\)` using `\(X\)` 2. Understand how the features `\(\{X_j\}_{j = 1}^p\)` .green[affect] predictions 3. Ensure approaches to (1) and (2) are .green[reproducible] ‍**Questions**: -- 1. How to .blue1[assess] prediction performance? -- 2. What do we mean by .blue1[affecting predictions]? -- 3. How to .blue1[encode many decisions] in a pipeline? --- ## Containerization: the problem Trying to run someone else's code: .code-box[ \> library(ggplot2) <br> .red[Error in library(ggplot2) : there is no package called ‘ggplot2’] <br> \> install.packages("ggplot2") <br> .red[configure: error: no acceptable C compiler found in $PATH] <br> ] -- You Google the error, install `gcc`, re-install the package. On the next line: .code-box[ \> library(xgboost) <br> .red[Error in library(xgboost) : there is no package called ‘xgboost’] <br> \> install.packages("xgboost") <br> .red[package ‘xgboost’ is not available (for R version 3.2.0)] ] -- ... so you update `R` -- ... but this breaks the fix for `ggplot2`. --- ## Containerization: the problem Making code reproducible for *you* `\(\neq\)` reproducible for __someone else__: -- * Are the data publicly-available? Or behind a firewall? -- * Can your code run on any operating system? -- * What dependencies (e.g., packages) does your code rely on? -- * Does your code rely on specific versions of software? -- * ... -- We want to package up all __analysis code and software__, allowing us to: * reproduce the analysis in its entirety, anywhere * control the environment exactly --- ## Containerization: the solution Think of a _container_ as a __virtual machine__: -- * has its own operating system -- * has its own file system -- * has its own software applications -- **You** specify all details for the container; users then .blue1[run] the container. -- [Docker](https://www.docker.com/) is a popular containerization program. --- ## Docker process To create and run a container using Docker: -- 1. Write a `Dockerfile`: plain text telling Docker what to put in your _image_. -- 2. Build the image (using `docker build`). -- 3. Run the image (using `docker run`; creates the container) to execute code. --- ## Example: downloading and processing data ‍CATNAP: publicly-available database .small[ [Yoon et al. (2015) _Nuclic Acids Res_] ]: * `\(\text{IC}_{50}\)` and `\(\text{IC}_{80}\)` neutralization values - `\(\text{IC}_{x}\)`: concentration that neutralizes `\(x\)` percent of pseudoviruses -- * several derived outcomes: - for `\(J\)` nAbs, `\(\text{predicted IC}_{k} = \left(\sum_{j=1}^J \text{IC}_{k,j}^{-1}\right)^{-1}\)` - `\(\text{Estimated sensitivity} = \text{predicted IC}_{50} < 1\)` - `\(\text{Multiple sensitivity} = \sum_{j=1}^J(\text{IC}_{50, j} < 1) \geq 2\)` --- ## Example: downloading and processing data ‍CATNAP: publicly-available database .small[ [Yoon et al. (2015) _Nuclic Acids Res_] ]: * Input feature groups: * Amino acid sequence sites: gp120 CD4 binding sites, gp120 V2, gp120 V3, gp41 MPER * Other groups: Region-specific counts of PNG sites, viral geometry, cysteine counts --- ## Example: downloading and processing data **S**uper **L**e**A**rner **P**rediction of **NA**b **P**anels (SLAPNAP): -- * packaged as a Docker container -- * options allow for simple CATNAP data pulling and cleaning -- Example run (on Ubuntu): neutralization data for bnAb VRC01 (used in AMP) ```bash sudo docker pull slapnap/slapnap sudo docker run \ -v "<path to desired output directory>:/home/output/" \ -e nab="VRC01" \ -e return="data" \ slapnap ``` <img src="img/slapnap_datapull.png" width="90%" style="display: block; margin: auto;" /> --- ## Example: under the hood To make the container that pulls CATNAP data, we need to: -- .pull-left[ * specify the operating system * install necessary software * pull data from web (`wget`) * `R` (for data cleaning; specify a version) * `pandoc` (for Rmarkdown metadata) * install specific R packages ] -- .pull-right[ * make directories * store data * store code * store output * copy code into the container * specify user options (e.g., `nab`) * download CATNAP data ] --- ## Example: under the hood <img src="img/dockerfile_init_blank.png" width="90%" style="display: block; margin: auto;" /> --- ## Example: under the hood <img src="img/dockerfile_init_1.png" width="90%" style="display: block; margin: auto;" /> --- ## Example: under the hood <img src="img/dockerfile_init_2.png" width="90%" style="display: block; margin: auto;" /> --- ## Example: under the hood <img src="img/dockerfile_init_3.png" width="90%" style="display: block; margin: auto;" /> --- ## Example: under the hood <img src="img/dockerfile_data-r-libs.png" width="90%" style="display: block; margin: auto;" /> --- ## Example: under the hood <img src="img/dockerfile_opts-nab.png" width="60%" style="display: block; margin: auto;" /> --- ## Example: under the hood <img src="img/dockerfile_data.png" width="100%" style="display: block; margin: auto;" /> --- ## Example: under the hood Example `run_analysis.sh`: <img src="img/run_analysis_data.png" width="100%" style="display: block; margin: auto;" /> --- ## Predicting neutralization sensitivity Predicting neutralization sensitivity (e.g., `\(\text{IC}_{50}\)`) using CATNAP data: -- * want .green[good] predictions -- * how to .blue1[assess] this? -- Goodness of fit! Common measures: -- * Regression: `\(R^2\)` (MSE) -- * Classification: accuracy, AUC -- With simple algorithms (e.g., linear regression): -- * point estimates .blue1[unbiased] -- * .blue2[inference] on population quantity --- ## More complex algorithms Your collaborator says "let's use machine learning!": -- What algorithm to use? -- * lasso, for sparsity? -- * random forests, because trees? -- * neural networks, for complex relationships? -- Why not .blue1[combine] these estimators! --- ## Stacking and super learning ‍.blue1[Stacking] .small[ [e.g., Wolpert (1992), _Neural Networks_] ]: -- * specify a _library_ of learners (e.g., lasso, random forests) * can have varying tuning parameters -- * use cross-validation to build an _ensemble_ of these learners -- ‍.blue2[Super learning (SL)] .small[ [van der Laan et al. (2007), _Stat Appl Genet Mol Biol_] ]: * specific implementation of stacking * nice asymptotic guarantees -- **Questions** * .red[are point estimates of predictivness unbiased]? * .red[can we do inference]? --- ## Cross-validated predictiveness Using .blue1[cross-validated] measures of predictiveness can help. .small[ [e.g., Hubbard et al. (2016), _Int J Biostat_] ] For each of `\(K\)` folds: * learn `\(f_{n,k}\)` (.blue1[fixed]) from training data * `\(P_{n,k}\)`: empirical distribution based on validation data -- ‍Examples: * .small[ .blue1[Population] ] `\({\small \text{CV-}R^2(P_0) = 1 - \frac{K^{-1}\sum_{k=1}^K E_{P_0}\{Y - f_{n,k}(X)\}^2}{var_{P_0}(Y)}}\)` -- * `\({\small \text{CV-}R^2(\{P_{n,k}\}_{k=1}^K) = 1 - \frac{K^{-1}\sum_{k=1}^K E_{P_{n,k}}\{Y - f_{n,k}(X)\}^2}{var_{P_n}(Y)}}\)` -- * .small[ .blue1[Population] ] `\({\small \text{CV-AUC}(P_0) = K^{-1}\sum_{k=1}^K P_0\{f_{n,k}(X_1) < f_{n,k}(X_2) \mid Y_1 = 0, Y_2 = 1\}}\)` ] -- * `\({\small \text{CV-AUC}(\{P_{n,k}\}_{k=1}^K) = K^{-1}\sum_{k=1}^K P_{n,k}\{f_{n,k}(X_1) < f_{n,k}(X_2) \mid Y_1 = 0, Y_2 = 1\}}\)` --- ## Cross-validated predictiveness ‍Pros: * .blue1[simple] estimator * .blue1[inference] based on the efficient influence function (EIF) * good .blue2[small-sample] properties .small[ [Benkeser et al. (2019)] ] -- ‍Cons: * .red[correct target parameter]? --- ## Population predictiveness We can instead view predictiveness as a .blue1[population] quantity. .small[ [e.g., Williamson et al. (2020), _arXiv_] ] -- ‍Define: * `\(V(f, P)\)`: .blue1[predictiveness] of `\(f \in \mathcal{F}\)` under distribution `\(P\)` -- * `\(f_0 := \underset{f \in \mathcal{F}}{\text{arg}\,\text{max}}\, V(f, P_0)\)`: _oracle prediction function_ in `\(\mathcal{F}\)` -- `\(V(f_0, P_0)\)`: _oracle predictiveness_ -- Example measures: * `\(R^2(P_0) = 1 - E_{P_0}\{Y - f_0(X)\}^2 / var_{P_0}(Y)\)` * `\(\text{AUC}(P_0) = P_0\{f_0(X_1) < f_0(X_2) \mid Y_1 = 0, Y_2 = 1\}\)` --- ## Population predictiveness ‍Pros: * .blue1[simple] estimator * .blue1[inference] based on the efficient influence function (EIF) ‍Cons: * .red[need larger sample size] --- ## Example: adding Super Learning to the container Adding the regression task to the container: 1. Create new `R` script that runs SL -- 2. Add to `Dockerfile`: * install `R` packages (e.g., `SuperLearner`, `xgboost`, `glmnet`) * copy new `R` script into container * make executable -- 3. Add to `run_analysis.sh`: * call new `R` script --- ## Example: Super Learning ```bash sudo docker run \ -v "<path to desired output directory>:/home/output/" \ -e nab="VRC01" \ -e learners="lasso;rf;xgboost" \ -e nfolds="5" \ -e return="data;learner" \ slapnap ``` <img src="img/slapnap_learner.png" width="100%" style="display: block; margin: auto;" /> --- ## Understanding predictions Distinguishing between contributions of predictions... -- * (.blue1[extrinsic importance]) ... .blue1[by a given (possibly black-box) algorithm] .small[ [e.g., Breiman, (2001)] ] -- * (.blue1[intrinsic importance]) ... .blue1[by best possible (i.e., oracle) algorithm] .small[ [e.g., van der Laan (2006)] ] -- Example use of .blue1[extrinsic] variable importance: * how does my algorithm use a given covariate in predictions? -- Example use of .blue2[intrinsic] variable importance: * is it worth collecting a given covariate for the sake of predicting neutralization sensitivity? --- ## Inference on population variable importance .blue1[Intrinsic] importance can be defined using _oracle predictiveness_. -- ‍Define: * `\(f_{0,s} := \underset{f \in \mathcal{F}_{-s}}{\text{arg}\,\text{max}}\, V(f, P_0)\)`: _oracle prediction function_ in `\(\mathcal{F}_{-s}\)` * `\(V(f_{0,s}, P_0)\)`: _residual oracle predictiveness_ -- .green[Importance]: `\(V(f_0, P_0) - V(f_{0,s}, P_0)\)` -- Valid estimation and inference .blue1[using simple, plug-in estimators]! .small[ [Williamson et al. (2020), _Biometrics_; Williamson et al. (2020), _arXiv_; extended in Williamson and Feng (2020), _ICML_] ] --- ## Example: adding variable importance to the container Adding the regression task to the container: 1. Add additional regressions to SL `R` script -- 2. Create new `R` script that estimates importance -- 3. Add to `Dockerfile`: * install `R` package `vimp` * copy new `R` script into container * make executable -- 4. Add to `run_analysis.sh`: * call new `R` script --- ## Example: putting it all together ```bash sudo docker run \ -v "<path to desired output directory>:/home/output/" \ -e nab="VRC01" \ -e learners="lasso;rf;xgboost" \ -e nfolds="5" \ -e importance_ind="marg;pred" \ -e importance_grp="marg;cond" \ -e return="data;learner;report" \ slapnap ``` <img src="img/slapnap_importance-report.png" width="100%" style="display: block; margin: auto;" /> [example full report](https://benkeser.github.io/slapnap/reports/report_VRC01.html) --- ## Studying SLAPNAP performance We evaluated SLAPNAP in two analyses: -- * Studying .blue1[all single bnAbs] in CATNAP -- Compared against two methods: * Hake and Pfeifer (2017): support vector machines * Rawi et al. (2019): gradient boosted trees -- * Studying .green[combination regimens] with sufficient data -- Includes several regimens currently in clinical testing --- ## Single-bnAb performance <img src="img/single-nab_auc_fig.png" width="100%" style="display: block; margin: auto;" /> -- .pull-left[ Overall median CV-AUCs: 0.71 (HP), 0.84 (Rawi), 0.8 (SLAPNAP) ] .pull-right[ No clear winner over all epitopes ] --- ## Combination-regimen performance <img src="img/multi-nab_auc_fig.png" width="70%" style="display: block; margin: auto;" /> --- ## Future work Aspects of this approach are being adopted in several current studies: * HVTN immune correlates analyses * CoVPN immune correlates analyses * WHI metabolomics studies -- Avenues for future work: * web-based tools * adding learners to SLAPNAP * variable selection (my postdoctoral focus) * other variable importance measures (e.g., time-varying AUC) --- ## Closing thoughts * .blue1[Containerization] can help ensure reproducibility -- * .blue1[Inference] in the context of machine learning: * prediction performance: .green[rigorous assessment] * variable importance: .green[understanding contributions] in prediction -- * HIV bnAb pipeline implemented in open-source software `SLAPNAP` -- [source code](https://github.com/benkeser/slapnap) | [Docker image](https://hub.docker.com/r/slapnap/slapnap) | [preprint (bioRxiv)](https://www.biorxiv.org/content/10.1101/2020.06.23.167718v1)
https://github.com/bdwilliamson
https://bdwilliamson.github.io --- ## References * .small[ Benkeser D, Peterson M, and van der Laan MJ. 2019. Improved Small-Sample Estimation of Nonlinear Cross-Validated Prediction Metrics. _Journal of the American Statistical Association_. ] * .small[ Breiman, L. 2001. Random forests. _Machine Learning_.] * .small[ Gilbert PB, Juraska M, deCamp AC, Karuna S, et al. 2017. Basis and Statistical Design of the Passive HIV-1 Antibody Mediated Prevention (AMP) Test-of-Concept Efficacy Trials. _Statistical Communications in Infectious Diseases_. ] * .small[ Hubbard AE, Kherad-Pajouh S, and van der Laan MJ. 2016. Statistical Inference for Data Adaptive Target Parameters. _International Journal of Biostatistics_. ] * .small[ Magaret CA, Benkeser DC, Williamson BD, et al. 2019. Prediction of VRC01 neutralization sensitivity by HIV-1 gp160 sequence features. _PLoS Computational Biology_. ] * .small[ van der Laan, MJ. 2006. Statistical inference for variable importance. _The International Journal of Biostatistics_.] * .small[ van der Laan MJ, Polley EC, and Hubbard AE. 2007. Super Learner. _Statistical Applications in Genetics and Molecular Biology_. ] --- ## References * .small[ Williamson B, Gilbert P, Carone M, and Simon N. 2020. Nonparametric variable importance assessment using machine learning techniques (+ rejoinder to discussion). _Biometrics_. ] * .small[ Williamson B, Gilbert P, Simon N, and Carone M. 2020. A unified approach for inference on algorithm-agnostic variable importance. _ArXiv technical report_. ] * .small[ Williamson B and Feng J. 2020. Efficient nonparametric statistical inference on population feature importance using Shapley values. _ICML_. ] * .small[ Wolpert DH. 1992. Stacked generalization. _Neural Networks_. ] * .small[ Yoon H, Macke J, West AP, et al. 2015. CATNAP: a tool to compile, analyze and tally neutralizing antibody panels. _Nucleic Acids Research_. ]