Emphasising the Practical Usefulness of R Packages in Research


Gianluca Baio

Department of Statistical Science   |   University College London

g.baio@ucl.ac.uk


https://gianluca.statistica.it

https://egon.stats.ucl.ac.uk/research/statistics-health-economics

https://github.com/giabaio   https://github.com/StatisticsHealthEconomics  

@gianlubaio@mas.to     @gianlubaio    


EsCHER’s R-evolution: Mastering R Package Development for Health Economics

Erasmus University Rotterdam

23 April 2024

Check out our departmental podcast “Random Talks” on Soundcloud!

Follow our departmental social media accounts + magazine “Sample Space”

Disclaimer…

(The real) Erasmus?…

Outline

  1. The evolution of R packages
    • Historical perspective, development and impact of R packages in statistics and beyond
  1. Package management tools
    • Decoding :: and ::: operators in R: Usage, importance and differences
    • packrat and/or renv: functionality, strengths, and best usage scenarios
  1. Installation and sharing techniques across various platforms
    • Detailed guidance on installing R packages from sources like CRAN, GitHub, and local/zip files, ensuring a wide accessibility
  1. Version control in package development
    • Effective version control, feature requests, pull requests
  1. Vignettes vs Tutorials
    • Roles and differences, and how to effectively create each for user guidance and education

R and packages

https://www.r-project.org/about.html

A little bit of history…

(Kind of stolen from here)

  • 1976: John Chambers at Bell Laboratories develops S
    • Mainly a bunch of Fortran libraries to do internal “statistical analysis”
  • 1988: System re-written in C (v. 3)
  • 1993: Bell Labs sells exclusive license to develop S to StatSci (now Insightful Corp)
    • Insightful markets the commercial version S-Plus, building in a bunch of fancy features
  • 1998: v.4 released — pretty much current version

In the meantime…

  • 1991: Ross Ihaka + Robert Gentleman develop R using syntax that is very similar to S
    • Effectively, R is a spinoff from S — but there are some fundamental distinctions among them…
  • 1995: Adopt the GNU General Public License to make R free software
  • 1996/7: Public mailing list created (R-help and R-devel) + R-Core Group formed
    • The Core Group controls (and is allowed to modify) the source code for R
  • 2000: v1.0.0 release

Main features of R

  • Runs on pretty much any computing platform/OS
  • Active development with huge community
    • Great help/documentation facility…
  • Structured in modular packages
  • Excellent graphical capabilities
    • Still much, much better than most other stats software!
  • Free

What’s not to like?…

  • Technology is fundamentally a bit old
    • Basically the core of the software has not changed much in the last 40 years…
  • Nobody to complain with if things go wrong
    • One of the reasons bodies such as the FDA often still insist on corporation software (eg SAS)…
  • Technical issue: objects must be loaded in the computer physical memory
    • What about “big data”?…

Packages

  • The common R installation comes with a basic core of functions and functionalities (base)
  • Then there are a gazillion packages you can install if/when you need them

  • An R package is a collection of functions that are bundled together in a way that lets them be easily shared
  • Usually these functions are designed to work together to complete a specific task such as analysing a particular kind of data

Packages

https://r-pkgs.org/introduction.html

Why and how to write a package?

  • Often, you don’t set out to write a package, as if it were the calling of your life…

Californian band that was big for like 20 minutes in the very early 2000s

2014 Canadian crime thriller film adapted from the 2008 novel of the same name by Michael Redhill

Why and how to write a package?

  • Often, you don’t set out to write a package, as if it were the calling of your life…
  • In fact, you often start writing your (ugly…) functions because you have a specific problem and need to solve it1
## Runs the C/E analysis, given matrices of costs and effects from BUGS model

## Differential effects and costs
delta.e <- e[,2] - e[,1]
delta.c <- c[,2] - c[,1]

## Plots the C/E plane
m.e <- range(delta.e)[1]
M.e <- range(delta.e)[2]
m.c <- range(delta.c)[1]
M.c <- range(delta.c)[2]
step <- (M.e-m.e)/10
m.e <- ifelse(m.e<0,m.e,-m.e)
m.c <- ifelse(m.c<0,m.c,-m.c)
x.pt <- .95*m.e
y.pt <- ifelse(x.pt*wtp<m.c,m.c,x.pt*wtp)
xx <- seq(100*m.c/wtp,100*M.c/wtp,step)
yy <- xx*wtp
xx[1] <- ifelse(min(xx)<m.e,xx[1],2*m.e)
yy[1] <- ifelse(min(yy)<m.c,yy[1],2*m.c)
xx[length(xx)] <- ifelse(xx[length(xx)]<M.e,1.5*M.e,xx[length(xx)])
if(!is.null(xlim)) {m.e <- xlim[1]; M.e <- xlim[2]}
if(!is.null(ylim)) {m.c <- ylim[1]; M.c <- ylim[2]}
plot(xx,yy,col="white",xlim=c(m.e,M.e),ylim=c(m.c,M.c),xlab=xlab,ylab=ylab,main=title,axes=F)
...

Why and how to write a package?

  • Often, you don’t set out to write a package, as if it were the calling of your life…
  • In fact, you often start writing your (ugly…) functions because you have a specific problem and need to solve it…
  • And then you realise you are using the same functions over and over and over…
    • But you have to slightly adapt it every time, because the data are not quite the same, or you want a slightly different output, or you want a new output for your next paper…

That’s when you know you need to turn it all into a package!

https://r4ds.had.co.nz/functions.html#when-should-you-write-a-function

Why and how to write a package?

  • Back in the days, writing a package was a massive pain in the neck
    • Documentation was a bit complicated
    • You needed to write a lot of different things manually and using sometimes awkward formats
    • You would never remember which way things needed to be done and always had to go back to a previously done package to know what to do…
  • In the last few years, things have become much better!

Enter devtools & roxygen

Creating a package

# Uses `devtools` to create the package skeleton
devtools::create("testPackage",open = TRUE)
✔ Creating 'testPackage/'
✔ Setting active project to '/home/gianluca/erasmus/testPackage'
✔ Creating 'R/'
✔ Writing 'DESCRIPTION'
Package: testPackage
Title: What the Package Does (One Line, Title Case)
Version: 0.0.0.9000
Authors@R (parsed):
    * First Last <first.last@example.com> [aut, cre] (YOUR-ORCID-ID)
Description: What the package does (one paragraph).
License: `use_mit_license()`, `use_gpl3_license()` or friends to
    pick a license
Encoding: UTF-8
Roxygen: list(markdown = TRUE)
RoxygenNote: 7.3.1
✔ Writing 'NAMESPACE'
✔ Writing 'testPackage.Rproj'
✔ Adding '^testPackage\\.Rproj$' to '.Rbuildignore'
✔ Adding '.Rproj.user' to '.gitignore'
✔ Adding '^\\.Rproj\\.user$' to '.Rbuildignore'
✔ Opening '/home/gianluca/erasmus/testPackage/' in new RStudio session
✔ Setting active project to '<no active project>'
  • This creates the skeleton for your new testPackage
    • DESCRIPTION
    • NAMESPACE
    • R folder

(Aside) :: vs ::: vs library()

NAMESPACE etc

Functions in a package can be accessed in different ways (and giving/using different levels of access)

  • You (the developer) can choose what you make available (directly) for your users
    • Use the NAMESPACE
    • That’s a file in which you explicitly “export” functions from your package to the outside world
    • In reality, devtools can do a lot of the heavy-lifting for you…

Example: BCEA NAMESPACE file

# Generated by roxygen2: do not edit by hand

S3method("CEriskav<-",bcea)
S3method("CEriskav<-",default)
S3method("mixedAn<-",bcea)
S3method("mixedAn<-",default)
S3method("setComparisons<-",bcea)
S3method("setComparisons<-",default)
S3method("setKmax<-",bcea)
S3method("setKmax<-",default)
S3method("setReferenceGroup<-",bcea)
S3method("setReferenceGroup<-",default)
S3method(bcea,bugs)
S3method(bcea,default)
S3method(bcea,rjags)
S3method(bcea,rstan)
S3method(ceac.plot,bcea)
S3method(ceaf.plot,pairwise)
S3method(ceef.plot,bcea)
S3method(ceplane.plot,bcea)
S3method(contour,bcea)
S3method(contour2,bcea)
S3method(createInputs,bugs)
S3method(createInputs,data.frame)
S3method(createInputs,default)
S3method(createInputs,numeric)
S3method(createInputs,rjags)
S3method(createInputs,stanfit)
S3method(eib.plot,bcea)
S3method(evi.plot,bcea)
S3method(evi.plot,mixedAn)
S3method(evppi,bcea)
S3method(evppi,default)
S3method(ib.plot,bcea)
S3method(info.rank,bcea)
S3method(line_labels,default)
S3method(line_labels,pairwise)
S3method(multi.ce,bcea)
S3method(plot,CEriskav)
S3method(plot,bcea)
S3method(plot,evppi)
S3method(plot,mixedAn)
S3method(print,bcea)
S3method(sim_table,bcea)
S3method(sim_table,default)
S3method(summary,bcea)
S3method(summary,mixedAn)
S3method(summary,pairwise)
export("CEriskav<-")
export("mixedAn<-")
export("setComparisons<-")
export("setKmax<-")
export("setReferenceGroup<-")
export(bcea)
export(best_interv_given_k)
export(ce_table)
export(ceac.plot)
export(ceaf.plot)
export(ceef.plot)
export(ceplane.plot)
export(compute_CEAC)
export(compute_EIB)
export(compute_EVI)
export(compute_IB)
export(compute_ICER)
export(compute_U)
export(compute_Ustar)
export(compute_kstar)
export(compute_ol)
export(compute_vi)
export(contour)
export(contour2)
export(createInputs)
export(diag.evppi)
export(eib.plot)
export(evi.plot)
export(evppi)
export(ib.plot)
export(info.rank)
export(info_rank_plotly)
export(is.bcea)
export(line_labels)
export(make.report)
export(make_legend_plotly)
export(mce.plot)
export(multi.ce)
export(new_bcea)
export(prep_ceplane_params)
export(setComparisons)
export(sim_table)
export(struct.psa)
export(tabulate_means)
export(validate_bcea)
import(dplyr)
import(ggplot2)
import(grid)
import(reshape2)
importFrom(MASS,kde2d)
importFrom(MCMCvis,MCMCchains)
importFrom(Matrix,Matrix)
importFrom(Matrix,expm)
importFrom(Rdpack,reprompt)
importFrom(cli,cli_alert_warning)
importFrom(dplyr,arrange)
importFrom(dplyr,desc)
importFrom(dplyr,filter)
importFrom(dplyr,mutate)
importFrom(dplyr,slice)
importFrom(grDevices,colors)
importFrom(grDevices,colours)
importFrom(grDevices,dev.new)
importFrom(grDevices,devAskNewPage)
importFrom(grDevices,grey.colors)
importFrom(grDevices,pdf.options)
importFrom(grDevices,ps.options)
importFrom(graphics,abline)
importFrom(graphics,axis)
importFrom(graphics,barplot)
importFrom(graphics,box)
importFrom(graphics,contour)
importFrom(graphics,legend)
importFrom(graphics,lines)
importFrom(graphics,matlines)
importFrom(graphics,matplot)
importFrom(graphics,par)
importFrom(graphics,points)
importFrom(graphics,polygon)
importFrom(graphics,rect)
importFrom(graphics,text)
importFrom(grid,unit)
importFrom(gridExtra,grid.arrange)
importFrom(purrr,keep)
importFrom(purrr,list_cbind)
importFrom(purrr,map)
importFrom(purrr,map_int)
importFrom(purrr,map_lgl)
importFrom(purrr,pluck)
importFrom(reshape2,melt)
importFrom(rlang,.data)
importFrom(rlang,int)
importFrom(rstan,extract)
importFrom(scales,label_dollar)
importFrom(stats,as.formula)
importFrom(stats,cov)
importFrom(stats,density)
importFrom(stats,lm)
importFrom(stats,qnorm)
importFrom(stats,qqline)
importFrom(stats,qqnorm)
importFrom(stats,quantile)
importFrom(stats,reorder)
importFrom(stats,rnorm)
importFrom(stats,runif)
importFrom(stats,sd)
importFrom(stats,setNames)
importFrom(utils,methods)
importFrom(utils,modifyList)
importFrom(utils,str)
importFrom(utils,tail)

  1. Functions that are exported are directly available when you load a package onto your workspace, using the library(package) command

    • So you can do something like

      library(BCEA)
      m=bcea(eff,cost,ref=2,interventions=c("Status quo","New drug"),...)

      and bcea is available for you to use…

    • OR, even without loading the package (assuming it’s installed in your R installation), you can still access the exposed functions using the “::” notation

      BCEA::ceplane.plot(m)

  1. Functions that are not exposed (so the user may not even know they exist…), can still be accessed using the “:::” notation

    BCEA:::add_ceplane_icer()

Doing the package

  • Now start editing the DESCRIPTION file (to add your details and more info on the project/package)

Doing the package

  • Now start editing the DESCRIPTION file (to add your details and more info on the project/package)
  • And, in the R folder, start writing your functions is suitable .R files
#' Helper function to make the Kaplan-Meier analysis of the underlying data
#' for a given formula and dataset
#' 
#' @param formula a formula specifying the model to be used, in the form
#' \code{Surv(time,event)~treatment[+covariates]} in flexsurv terms, or
#' \code{inla.surv(time,event)~treatment[+covariates]} in INLA terms.
#' @param method A string specifying the inferential method (\code{'mle'},
#' \code{'inla'} or \code{'hmc'}). If \code{method} is set to \code{'hmc'},
#' then \code{survHE} will write suitable model code in the Stan language
#' (according to the specified distribution), prepare data and initial values
#' and then run the model.
#' @param data A data frame containing the data to be used for the analysis.
#' This must contain data for the 'event' variable. In case there is no
#' censoring, then \code{event} is a column of 1s.
#' @return \item{ObjSurvfit}{A 'rms::npsurv' estimate of the KM curves}.
#' @author Gianluca Baio
#' @seealso fit.models
#' @template refs
#' @keywords Kaplan-Meier estimate 
#' @noRd 
#' 
make_KM <- function(formula,data) {
  km.formula <- as.formula(gsub("inla.surv","Surv",deparse(formula)))
  # Computes the Kaplan Meier curve using the package "rms"
  ObjSurvfit <- rms::npsurv(      # Uses the function "npsurv" from the package "rms"
    formula = km.formula,         # to fit the model specified in the "formula" object
    data = data                   # to the dataset named "data"
  )
  return(ObjSurvfit)
}

Doing the package

In the file man-roxygen/refs.R

#' @references G Baio (2019). survHE: Survival analysis for health economic evaluation 
#' and cost-effectiveness modelling. Journal of Statistical Software (2020). vol 95,
#' 14, 1-47. <doi:10.18637/jss.v095.i14>

NB: You can also add a .bib file with all the relevant references you want to use (in bibtex format) in the folder /inst/REFERENCES.bib and then in your documentation

#' @references
#' \insertRef{label-for-the-reference}{name-of-the-package}

for instance…

#' @references
#' \insertRef{Baio2012}{BCEA}

Doing the package

  • When you’re done with all the functions, you can create the full documentation, using devtools and roxygen
#' Run `devtools::document` to run `roxygen` in the background and generate
#' the relevant documentation. All the files will be stored in `.Rd` format
#' under the `man` folder in your package directory
devtools::document()
  • Can be very helpful if you have lots of published work that you’ll want to refer to for many of the functions…

Doing the package

  • When you’re done with all the functions, you can create the full documentation, using devtools and roxygen
#' Run `devtools::document` to run `roxygen` in the background and generate
#' the relevant documentation. All the files will be stored in `.Rd` format
#' under the `man` folder in your package directory
devtools::document()
  • Can be very helpful if you have lots of published work that you’ll want to refer to for many of the functions…

  • Can also do other things to the package…
#' If there are examples embedded in the `roxygen` code for documentation, then
#' can automatically run them
devtools::run_examples()

#' "Check" the package as if it were to be submitted to CRAN
devtools::check()

#' Installs the package **locally*
devtools::install()

Doing the package

  • When you’re done with all the functions, you can create the full documentation, using devtools and roxygen
#' Run `devtools::document` to run `roxygen` in the background and generate
#' the relevant documentation. All the files will be stored in `.Rd` format
#' under the `man` folder in your package directory
devtools::document()
  • Can be very helpful if you have lots of published work that you’ll want to refer to for many of the functions…

  • Can also do other things to the package…
#' If there are examples embedded in the `roxygen` code for documentation, then
#' can automatically run them
devtools::run_examples()

#' "Check" the package as if it were to be submitted to CRAN
devtools::check()

#' Installs the package **locally*
devtools::install()

Doing the package

  • When you’re done with all the functions, you can create the full documentation, using devtools and roxygen
#' Run `devtools::document` to run `roxygen` in the background and generate
#' the relevant documentation. All the files will be stored in `.Rd` format
#' under the `man` folder in your package directory
devtools::document()
  • Can be very helpful if you have lots of published work that you’ll want to refer to for many of the functions…

  • Can also do other things to the package…
#' If there are examples embedded in the `roxygen` code for documentation, then
#' can automatically run them
devtools::run_examples()

#' "Check" the package as if it were to be submitted to CRAN
devtools::check()

#' Installs the package **locally*
devtools::install()

Package management

  • One of the most complex/confusing things working with R and packages is to deal with dependencies
    • Your package may be interconnected to others
psa_bootstrap <- function(x,num_reps) {
  # Here compute bootstraps estimates
  library(boot)
  sims=boot(data=x,statistic=mean,R=num_reps,...)
  ...
}

  • So, in this case, you need to make sure that boot is installed on your computer (and in general the user’s computer)!

Warning

  • It’s actually a very bad idea to use library(xxx) within a package — you would fail syntax controls when submitting to CRAN (more on this later!)
  • Use :: instead
    • Also better as you will be forced to remember what package’s functions you are using…
  • NB: Can use devtools::install_dependencies to try and automate the installation of all the required packages to run yours…

Controlling dependencies

Installing a package

  • The “standard” way is to go through the Comprehensive R Archive Network
    • This is the general repository to which you submit your package and from which packages can be downloaded
  • The process of submission is often a bit long and sometimes painful
    • Again, it used to be sooooo much worse — devtools et al have made the background preparation much easier
    • Checks are not about the actual quality of the package — more on the syntactic structure
    • It’s probable that there will be some back and forth — your package is not accepted on CRAN first time around and you need to fix several quirks…
  • When you’ve made it onto CRAN, from an R session, can install the package using the simple command
install.packages("name-of-the-package")

Installing a package

There are other ways of sharing your package and making it available with others

Local/zip file

  • Perhaps a “thing of the past”…
  • Compile your package into a .tar.gz (or, only for Windows, .zip) file that you can share with whoever
devtools::build(...)
  • Then, they can install “from source”
install.packages("name-of-the-package.tar.gz", repos=NULL, type='source')

Installing a package

There are other ways of sharing your package and making it available with others

Using GitHub

  • The “modern” way
  • Create a GitHub repository from the folder where you’re developing your package
  • Keep it updated (and ideally, well documented!)
    • Can let people “clone” your repo and create their own version of the package
    • Suggest/request changes/fixes (through “Pull Requests”, PRs)
    • Have a much quicker process of updating without the hold ups of CRAN
  • Users can install from “remote”
remotes::install.github(...)

Speaking of… version controlling

  • Over and above how effective this can be to develop a package, version control has become an increasingly important and helpful feature of computer programming
    • Very good for collaboration…
    • … Or even just for “solo” projects!
  • Avoid the infamous package_version_2_2024_04_22_final_new_GB.R file diaspora…
  1. Make your R package folder a .git repository
  2. Get your head around the git syntax and philosophy (some entry barrier…)
    • Basically just keep one version of the current files, but track the development history
    • Can always revert to previous versions if needs be
    • Collaborators can “fork” and have their own version, work on “branches” of the project, etc…
  3. And when you do, you really can’t do without!… 😉

Speaking of… version controlling

Can you please fix the package?…

  • Having the package on GitHub allows you to respond very quickly to requests for changes/bug alerts etc
    • See for instance https://github.com/giabaio/survHE/issues/45
    • Can discuss the issue and then (if you actually have time on the day… 😉) fix the problem, make changes and tell your users to install the new “development” version1
remotes::install_github("giabaio/survHE",ref="devel")

  • It is a good idea to use a “development” version of the repo — basically main is aligned with CRAN but devel is one (or several…) step(s) ahead
  • When you’re happy with a bunch of substantial changes, move devel to main and re-submit to CRAN

CRAN on GitHub

  • Perhaps a little known fact, but all packages that are officially stored on CRAN have their own official GitHub repo: https://github.com/cran/
    • You can fork a package and create your own version
    • Then you can use it for your own purposes…
    • … Or push the changes you make to the main package, so that they can be implemented in the “official” version!
  • May be good when a package is very helpful for your work, but it just doesn’t have that key feature you’d like…
  • Example: R2OpenBUGS vs R2jags
mbugs=R2OpenBUGS::bugs(...)
mjags=R2jags::jags(...)

names(mbugs)
 [1] "n.chains"        "n.iter"          "n.burnin"        "n.thin"         
 [5] "n.keep"          "n.sims"          "sims.array"      "sims.list"      
 [9] "sims.matrix"     "summary"         "mean"            "sd"             
[13] "median"          "root.short"      "long.short"      "dimension.short"
[17] "indexes.short"   "last.values"     "isDIC"           "DICbyR"         
[21] "pD"              "DIC"             "model.file"     
names(mjags)
[1] "model"              "BUGSoutput"         "parameters.to.save"
[4] "model.file"         "n.iter"             "DIC"               

Full documentation

Full documentation

  • You can write a .Rmd/.qmd file with the code/text of the vignette and store it in the ./vignettes folder in your package
  • Using pkgdown this is parsed and suitable html vignettes are created for your users!
---
title: "Risk Aversion Analysis"
output: rmarkdown::html_vignette
vignette: >
  %\VignetteIndexEntry{Risk Aversion Analysis}
  %\VignetteEncoding{UTF-8}
  %\VignetteEngine{knitr::rmarkdown}
editor_options: 
  chunk_output_type: console
---

```{r}
#| include: false
knitr::opts_chunk$set(
collapse = TRUE,
comment = "#>",
fig.width = 6
)
```

```{r}
#| label: setup-example
#| results: 'hide'
#| message: false 
#| warning: false 
#| echo: false
library(BCEA)
library(dplyr)
library(reshape2)
library(ggplot2)
library(purrr)
```

Set-up analysis using smoking cessation data set.

```{r}
data(Smoking)

treats <- c("No intervention", "Self-help", "Individual counselling", "Group counselling")
bcea_smoke <- bcea(eff, cost, ref = 4, interventions = treats, Kmax = 500)
```

Run the risk aversion analysis straight away with both the base R and ggplot2 versions of plots.

```{r}
r <- c(0, 0.005, 0.020, 0.035)
CEriskav(bcea_smoke) <- r

plot(bcea_smoke)
plot(bcea_smoke, graph = "ggplot")
```

Notice that the first value is asymptotically zero but the function handles that for us.
Previously, you had to use something like 1e-10 which was a bit awkward.

Now we modify the comparison group so that it doesn't contain 2 ("self-help") anymore.
We still keep the same first comparison though "no intervention".

...

Full documentation

The R-HTA consortium