R trends in 2015 (based on cranlogs)

What are the current tRends? The image is CC from coco + kelly.

What are the current tRends? The image is CC from coco + kelly.

It is always fun to look back and reflect on the past year. Inspired by Christoph Safferling’s post on top packages from published in 2015, I decided to have my own go at the top R trends of 2015. Contrary to Safferling’s post I’ll try to also (1) look at packages from previous years that hit the big league, (2) what top R coders we have in the community, and then (2) round-up with my own 2015-R-experience.

Everything in this post is based on the CRANberries reports. To harvest the information I’ve borrowed shamelessly from Safferling’s post with some modifications. He used the number of downloads as proxy for package release date, while I decided to use the release date, if that wasn’t available I scraped it off the CRAN servers. The script now also retrieves package author(s) and description (see code below for details).

library(rvest)
library(dplyr)
# devtools::install_github("hadley/multidplyr")
library(multidplyr)
library(magrittr)
library(lubridate)

getCranberriesElmnt <- function(txt, elmnt_name){
  desc <- grep(sprintf("^%s:", elmnt_name), txt)
  if (length(desc) == 1){
    txt <- txt[desc:length(txt)]
    end <- grep("^[A-Za-z/@]{2,}:", txt[-1])
    if (length(end) == 0)
      end <- length(txt)
    else
      end <- end[1]
    
    desc <-
      txt[1:end] %>% 
      gsub(sprintf("^%s: (.+)", elmnt_name),
           "\\1", .) %>% 
      paste(collapse = " ") %>% 
      gsub("[ ]{2,}", " ", .) %>% 
      gsub(" , ", ", ", .)
  }else if (length(desc) == 0){
    desc <- paste("No", tolower(elmnt_name))
  }else{
    stop("Could not find ", elmnt_name, " in text: \n",
         paste(txt, collapse = "\n"))
  }
  return(desc)
}

convertCharset <- function(txt){
  if (grepl("Windows", Sys.info()["sysname"]))
    txt <- iconv(txt, from = "UTF-8", to = "cp1252")
  return(txt)
}

getAuthor <- function(txt, package){
  author <- getCranberriesElmnt(txt, "Author")
  if (grepl("No author|See AUTHORS file", author)){
    author <- getCranberriesElmnt(txt, "Maintainer")
  }
  
  if (grepl("(No m|M)aintainer|(No a|A)uthor|^See AUTHORS file", author) || 
      is.null(author) ||
      nchar(author)  <= 2){
    cran_txt <- read_html(sprintf("http://cran.r-project.org/web/packages/%s/index.html",
                                  package))
    author <- cran_txt %>% 
      html_nodes("tr") %>% 
      html_text %>% 
      convertCharset %>% 
      gsub("(^[ \t\n]+|[ \t\n]+$)", "", .) %>% 
      .[grep("^Author", .)] %>% 
      gsub(".*\n", "", .)
    
    # If not found then the package has probably been
    # removed from the repository
    if (length(author) == 1)
      author <- author
    else
      author <- "No author"
  }
  
  # Remove stuff such as:
  # [cre, auth]
  # (worked on the...)
  # 
  # "John Doe"
  author %<>% 
    gsub("^Author: (.+)", 
         "\\1", .) %>% 
    gsub("[ ]*\\[[^]]{3,}\\][ ]*", " ", .) %>% 
    gsub("\\([^)]+\\)", " ", .) %>% 
    gsub("([ ]*<[^>]+>)", " ", .) %>% 
    gsub("[ ]*\\[[^]]{3,}\\][ ]*", " ", .) %>% 
    gsub("[ ]{2,}", " ", .) %>% 
    gsub("(^[ '\"]+|[ '\"]+$)", "", .) %>% 
    gsub(" , ", ", ", .)
  return(author)
}

getDate <- function(txt, package){
  date <- 
    grep("^Date/Publication", txt)
  if (length(date) == 1){
    date <- txt[date] %>% 
      gsub("Date/Publication: ([0-9]{4,4}-[0-9]{2,2}-[0-9]{2,2}).*",
           "\\1", .)
  }else{
    cran_txt <- read_html(sprintf("http://cran.r-project.org/web/packages/%s/index.html",
                                  package))
    date <- 
      cran_txt %>% 
      html_nodes("tr") %>% 
      html_text %>% 
      convertCharset %>% 
      gsub("(^[ \t\n]+|[ \t\n]+$)", "", .) %>% 
      .[grep("^Published", .)] %>% 
      gsub(".*\n", "", .)
    
    
    # The main page doesn't contain the original date if 
    # new packages have been submitted, we therefore need
    # to check first entry in the archives
    if(cran_txt %>% 
       html_nodes("tr") %>% 
       html_text %>% 
       gsub("(^[ \t\n]+|[ \t\n]+$)", "", .) %>% 
       grepl("^Old.{1,4}sources", .) %>% 
       any){
      archive_txt <- read_html(sprintf("http://cran.r-project.org/src/contrib/Archive/%s/",
                                       package))
      pkg_date <- 
        archive_txt %>% 
        html_nodes("tr") %>% 
        lapply(function(x) {
          nodes <- html_nodes(x, "td")
          if (length(nodes) == 5){
            return(nodes[3] %>% 
                     html_text %>% 
                     as.Date(format = "%d-%b-%Y"))
          }
        }) %>% 
        .[sapply(., length) > 0] %>% 
        .[!sapply(., is.na)] %>% 
        head(1)
      
      if (length(pkg_date) == 1)
        date <- pkg_date[[1]]
    }
  }
  date <- tryCatch({
    as.Date(date)
  }, error = function(e){
    "Date missing"
  })
  return(date)
}

getNewPkgStats <- function(published_in){
  # The parallel is only for making cranlogs requests
  # we can therefore have more cores than actual cores
  # as this isn't processor intensive while there is
  # considerable wait for each http-request
  cl <- create_cluster(parallel::detectCores() * 4)
  parallel::clusterEvalQ(cl, {
    library(cranlogs)
  })
  set_default_cluster(cl)
  on.exit(stop_cluster())
  
  berries <- read_html(paste0("http://dirk.eddelbuettel.com/cranberries/", published_in, "/"))
  pkgs <- 
    # Select the divs of the package class
    html_nodes(berries, ".package") %>% 
    # Extract the text
    html_text %>% 
    # Split the lines
    strsplit("[\n]+") %>% 
    # Now clean the lines
    lapply(.,
           function(pkg_txt) {
             pkg_txt[sapply(pkg_txt, function(x) { nchar(gsub("^[ \t]+", "", x)) > 0}, 
                            USE.NAMES = FALSE)] %>% 
               gsub("^[ \t]+", "", .) 
           })
  
  # Now we select the new packages
  new_packages <- 
    pkgs %>% 
    # The first line is key as it contains the text "New package"
    sapply(., function(x) x[1], USE.NAMES = FALSE) %>% 
    grep("^New package", .) %>% 
    pkgs[.] %>% 
    # Now we extract the package name and the date that it was published
    # and merge everything into one table
    lapply(function(txt){
      txt <- convertCharset(txt)
      ret <- data.frame(
        name = gsub("^New package ([^ ]+) with initial .*", 
                     "\\1", txt[1]),
        stringsAsFactors = FALSE
      )
      
      ret$desc <- getCranberriesElmnt(txt, "Description")
      ret$author <- getAuthor(txt, ret$name)
      ret$date <- getDate(txt, ret$name)
      
      return(ret)
    }) %>% 
    rbind_all %>% 
    # Get the download data in parallel
    partition(name) %>% 
    do({
      down <- cran_downloads(.$name[1], 
                             from = max(as.Date("2015-01-01"), .$date[1]), 
                             to = "2015-12-31")$count 
      cbind(.[1,],
            data.frame(sum = sum(down), 
                       avg = mean(down))
      )
    }) %>% 
    collect %>% 
    ungroup %>% 
    arrange(desc(avg))
  
  return(new_packages)
}

pkg_list <- 
  lapply(2010:2015,
         getNewPkgStats)

pkgs <- 
  rbind_all(pkg_list) %>% 
  mutate(time = as.numeric(as.Date("2016-01-01") - date),
         year = format(date, "%Y"))

Downloads and time on CRAN

The longer a package has been on CRAN the more downloaded it gets. We can illustrate this using simple linear regression, slightly surprising is that this behaves mostly linear:

pkgs %<>% 
  mutate(time_yrs = time/365.25)
fit <- lm(avg ~ time_yrs, data = pkgs)

# Test for non-linearity
library(splines)
anova(fit,
      update(fit, .~.-time_yrs+ns(time_yrs, 2)))
Analysis of Variance Table

Model 1: avg ~ time
Model 2: avg ~ ns(time, 2)
  Res.Df       RSS Df Sum of Sq      F Pr(>F)
1   7348 189661922                           
2   7347 189656567  1    5355.1 0.2075 0.6488

Where the number of average downloads increases with about 5 downloads per year. It can easily be argued that the average number of downloads isn't that interesting since the data is skewed, we can therefore also look at the upper quantiles using quantile regression:

library(quantreg)
library(htmlTable)
lapply(c(.5, .75, .95, .99),
       function(tau){
         rq_fit <- rq(avg ~ time_yrs, data = pkgs, tau = tau)
         rq_sum <- summary(rq_fit)
         c(Estimate = txtRound(rq_sum$coefficients[2, 1], 1), 
           `95 % CI` = txtRound(rq_sum$coefficients[2, 1] + 
                                        c(1,-1) * rq_sum$coefficients[2, 2], 1) %>% 
             paste(collapse = " to "))
       }) %>% 
  do.call(rbind, .) %>% 
  htmlTable(rnames = c("Median",
                       "Upper quartile",
                       "Top 5%",
                       "Top 1%"))
Estimate95 % CI
Median0.60.6 to 0.6
Upper quartile1.21.2 to 1.1
Top 5%9.711.9 to 7.6
Top 1%182.5228.2 to 136.9

The above table conveys a slightly more interesting picture. Most packages don't get that much attention while the top 1% truly reach the masses.

Top downloaded packages

In order to investigate what packages R users have been using during 2015 I've looked at all new packages since the turn of the decade. Since each year of CRAN-presence increases the download rates, I've split the table by the package release dates. The results are available for browsing below (yes - it is the new brand interactive htmlTable that allows you to collapse cells - note it may not work if you are reading this on R-bloggers and the link is lost under certain circumstances).

 Downloads 
NameAuthor TotalAverage/day Description
Top 10 packages published in 2015
xml2Hadley Wickham, Jeroen Ooms, RStudio, R Foundation 348,2221635 Work with XML files ...
rversionsGabor Csardi 386,9961524 Query the main R SVN...
git2rStefan Widgren 411,7091303 Interface to the lib...
praiseGabor Csardi, Sindre Sorhus 96,187673 Build friendly R pac...
readxlDavid Hoerl 99,386379 Import excel files i...
readrHadley Wickham, Romain Francois, R Core Team, RStudio 90,022337 Read flat/tabular te...
DiagrammeRRichard Iannone 84,259236 Create diagrams and ...
visNetworkAlmende B.V. (vis.js library in htmlwidgets/lib, 41,185233 Provides an R interf...
plotlyCarson Sievert, Chris Parmer, Toby Hocking, Scott Chamberlain, Karthik Ram, Marianne Corvellec, Pedro Despouy 9,745217 Easily translate ggp...
DTYihui Xie, Joe Cheng, jQuery contributors, SpryMedia Limited, Brian Reavis, Leon Gersen, Bartek Szopka, RStudio Inc 24,806120 Data objects in R ca...
Top 10 packages published in 2014
stringiMarek Gagolewski and Bartek Tartanus ; IBM and other contributors ; Unicode, Inc. 1,316,9003608 stringi allows for v...
magrittrStefan Milton Bache and Hadley Wickham 1,245,6623413 Provides a mechanism...
mimeYihui Xie 1,038,5912845 This package guesses...
R6Winston Chang 920,1472521 The R6 package allow...
dplyrHadley Wickham, Romain Francois 778,3112132 A fast, consistent t...
manipulateJJ Allaire, RStudio 626,1911716 Interactive plotting...
htmltoolsRStudio, Inc. 619,1711696 Tools for HTML gener...
curlJeroen Ooms 599,7041643 The curl() function ...
lazyevalHadley Wickham, RStudio 572,5461569 A disciplined approa...
rstudioapiRStudio 515,6651413 This package provide...
Top 10 packages published in 2013
jsonliteJeroen Ooms, Duncan Temple Lang 906,4212483 This package is a fo...
BHJohn W. Emerson, Michael J. Kane, Dirk Eddelbuettel, JJ Allaire, and Romain Francois 691,2801894 Boost provides free ...
highrYihui Xie and Yixuan Qiu 641,0521756 This package provide...
assertthatHadley Wickham 527,9611446 assertthat is an ext...
httpuvRStudio, Inc. 310,699851 httpuv provides low-...
NLPKurt Hornik 270,682742 Basic classes and me...
TH.dataTorsten Hothorn 242,060663 Contains data sets u...
NMFRenaud Gaujoux, Cathal Seoighe 228,807627 This package provide...
stringdistMark van der Loo 123,138337 Implements the Hammi...
SnowballCMilan Bouchet-Valat 104,411286 An R interface to th...
Top 10 packages published in 2012
gtableHadley Wickham 1,091,4402990 Tools to make it eas...
knitrYihui Xie 792,8762172 This package provide...
httrHadley Wickham 785,5682152 Provides useful tool...
markdownJJ Allaire, Jeffrey Horner, Vicent Marti, and Natacha Porte 636,8881745 Markdown is a plain-...
MatrixDouglas Bates and Martin Maechler 470,4681289 Classes and methods ...
shinyRStudio, Inc. 427,9951173 Shiny makes it incre...
latticeDeepayan Sarkar 414,7161136 Lattice is a powerfu...
pkgmakerRenaud Gaujoux 225,796619 This package provide...
rngtoolsRenaud Gaujoux 225,125617 This package contain...
base64encSimon Urbanek 223,120611 This package provide...
Top 10 packages published in 2011
scalesHadley Wickham 1,305,0003575 Scales map data to a...
devtoolsHadley Wickham 738,7242024 Collection of packag...
RcppEigenDouglas Bates, Romain Francois and Dirk Eddelbuettel 634,2241738 R and Eigen integrat...
fppRob J Hyndman 583,5051599 All data sets requir...
nloptrJelmer Ypma 583,2301598 nloptr is an R inter...
pbkrtestUlrich Halekoh Søren Højsgaard 536,4091470 Test in linear mixed...
roxygen2Hadley Wickham, Peter Danenberg, Manuel Eugster 478,7651312 A Doxygen-like in-so...
whiskerEdwin de Jonge 413,0681132 logicless templating...
doParallelRevolution Analytics 299,717821 Provides a parallel ...
abindTony Plate and Richard Heiberger 255,151699 Combine multi-dimens...
Top 10 packages published in 2010
reshape2Hadley Wickham 1,395,0993822 Reshape lets you fle...
labelingJustin Talbot 1,104,9863027 Provides a range of ...
evaluateHadley Wickham 862,0822362 Parsing and evaluati...
formatRYihui Xie 640,3861754 This package provide...
minqaKatharine M. Mullen, John C. Nash, Ravi Varadhan 600,5271645 Derivative-free opti...
gridExtraBaptiste Auguie 581,1401592 misc. functions
memoiseHadley Wickham 552,3831513 Cache the results of...
RJSONIODuncan Temple Lang 414,3731135 This is a package th...
RcppArmadilloRomain Francois and Dirk Eddelbuettel 410,3681124 R and Armadillo inte...
xlsxAdrian A. Dragulescu 401,9911101 Provide R functions ...


Just as Safferling et. al. noted there is a dominance of technical packages. This is little surprising since the majority of work is with data munging. Among these technical packages there are quite a few that are used for developing other packages, e.g. roxygen2, pkgmaker, devtools, and more.

R-star authors

Just for fun I decided to look at who has the most downloads. By splitting multi-authors into several and also splitting their downloads we can find that in 2015 the top R-coders where:

top_coders <- list(
  "2015" = 
    pkgs %>% 
    filter(format(date, "%Y") == 2015) %>% 
    partition(author) %>% 
    do({
      authors <- strsplit(.$author, "[ ]*([,;]| and )[ ]*")[[1]]
      authors <- authors[!grepl("^[ ]*(Inc|PhD|Dr|Lab).*[ ]*$", authors)]
      if (length(authors) >= 1){
        # If multiple authors the statistic is split among
        # them but with an added 20% for the extra collaboration
        # effort that a multi-author envorionment calls for
        .$sum <- round(.$sum/length(authors)*1.2)
        .$avg <- .$avg/length(authors)*1.2
        ret <- .
        ret$author <- authors[1]
        for (m in authors[-1]){
          tmp <- .
          tmp$author <- m
          ret <- rbind(ret, tmp)
        }
        return(ret)
      }else{
        return(.)
      }
    }) %>% 
    collect() %>% 
    group_by(author) %>% 
    summarise(download_ave = round(sum(avg)),
              no_packages = n(),
              packages = paste(name, collapse = ", ")) %>% 
    select(author, download_ave, no_packages, packages) %>% 
    collect() %>% 
    arrange(desc(download_ave)) %>% 
    head(10),
  "all" =
    pkgs %>% 
    partition(author) %>% 
    do({
      if (grepl("Jeroen Ooms", .$author))
        browser()
      authors <- strsplit(.$author, "[ ]*([,;]| and )[ ]*")[[1]]
      authors <- authors[!grepl("^[ ]*(Inc|PhD|Dr|Lab).*[ ]*$", authors)]
      if (length(authors) >= 1){
        # If multiple authors the statistic is split among
        # them but with an added 20% for the extra collaboration
        # effort that a multi-author envorionment calls for
        .$sum <- round(.$sum/length(authors)*1.2)
        .$avg <- .$avg/length(authors)*1.2
        ret <- .
        ret$author <- authors[1]
        for (m in authors[-1]){
          tmp <- .
          tmp$author <- m
          ret <- rbind(ret, tmp)
        }
        return(ret)
      }else{
        return(.)
      }
    }) %>% 
    collect() %>% 
    group_by(author) %>% 
    summarise(download_ave = round(sum(avg)),
              no_packages = n(),
              packages = paste(name, collapse = ", ")) %>% 
    select(author, download_ave, no_packages, packages) %>% 
    collect() %>% 
    arrange(desc(download_ave)) %>% 
    head(30))

interactiveTable(
  do.call(rbind, top_coders) %>% 
    mutate(download_ave = txtInt(download_ave)),
  align = "lrr",
  header = c("Coder", "Total ave. downloads per day", "No. of packages", "Packages"),
  tspanner = c("Top coders 2015",
               "Top coders 2010-2015"),
  n.tspanner = sapply(top_coders, nrow),
  minimized.columns = 4, 
  rnames = FALSE, 
  col.rgroup = c("white", "#F0F0FF"))
CoderTotal ave. downloadsNo. of packagesPackages
Top coders 2015
Gabor Csardi2,31211sankey, franc, rvers...
Stefan Widgren1,5631git2r
RStudio78116shinydashboard, with...
Hadley Wickham69512withr, cellranger, c...
Jeroen Ooms54110rjade, js, sodium, w...
Richard Cotton50122assertive.base, asse...
R Foundation4901xml2
David Hoerl4551readxl
Sindre Sorhus4092praise, clisymbols
Richard Iannone2942DiagrammeR, stationa...
Top coders 2010-2015
Hadley Wickham32,11555swirl, lazyeval, ggp...
Yihui Xie9,73918DT, Rd2roxygen, high...
RStudio9,12325shinydashboard, lazy...
Jeroen Ooms4,22125JJcorr, gdtools, bro...
Justin Talbot3,6331labeling
Winston Chang3,53117shinydashboard, font...
Gabor Csardi3,43726praise, clisymbols, ...
Romain Francois2,93420int64, LSD, RcppExam...
Duncan Temple Lang2,8546RMendeley, jsonlite,...
Adrian A. Dragulescu2,4562xlsx, xlsxjars
JJ Allaire2,4537manipulate, htmlwidg...
Simon Urbanek2,36915png, fastmatch, jpeg...
Dirk Eddelbuettel2,09433Rblpapi, RcppSMC, RA...
Stefan Milton Bache2,0693import, blatr, magri...
Douglas Bates1,9665PKPDmodels, RcppEige...
Renaud Gaujoux1,9626NMF, doRNG, pkgmaker...
Jelmer Ypma1,9332nloptr, SparseGrid
Rob J Hyndman1,9333hts, fpp, demography
Baptiste Auguie1,9242gridExtra, dielectri...
Ulrich Halekoh Søren Højsgaard1,7641pbkrtest
Martin Maechler1,68211DescTools, stabledis...
Mirai Solutions GmbH1,6033XLConnect, XLConnect...
Stefan Widgren1,5631git2r
Edwin de Jonge1,51310tabplot, tabplotGTK,...
Kurt Hornik1,47612movMF, ROI, qrmtools...
Deepayan Sarkar1,3694qtbase, qtpaint, lat...
Tyler Rinker1,2039cowsay, wakefield, q...
Yixuan Qiu1,13112gdtools, svglite, hi...
Revolution Analytics1,0114doParallel, doSMP, r...
Torsten Hothorn9487MVA, HSAUR3, TH.data...

It is worth mentioning that two of the top coders are companies, RStudio and Revolution Analytics. While I like the fact that R is free and open-source, I doubt that the community would have grown as quickly as it has without these companies. It is also symptomatic of 2015 that companies are taking R into account, it will be interesting what the R Consortium will bring to the community. I think the r-hub is increadibly interesting and will hopefully make my life as an R-package developer easier.

My own 2015-R-experience

My own personal R experience has been dominated by magrittr and dplyr, as seen in above code. As most I find that magrittr makes things a little easier to read and unless I have som really large dataset the overhead is small. It does have some downsides related to debugging but these are negligeable.

When I originally tried dplyr out I came from the plyr environment and was disappointed by the lack of parallelization, I found the concepts a little odd when thinking the plyr way. I had been using sqldf a lot in my data munging and merging, when I found the left_join, inner_joint, and the brilliant anti_join I was completely sold. Combined with RStudio I find the dplyr-workflow both intuitive and more productive than my previous.

When looking at those packages (including more than just the top 10 here) I did find some additional gems that I intend to look into when I have the time:

  • DiagrammeR An interesting new way of producing diagrams. I've used it for gantt charts but it allows for much more.
  • checkmate A neat package for checking function arguments.
  • covr An excellent package for testing how much of a package's code is tested.
  • rex A package for making regular easier.
  • openxlsx I wish I didn't have to but I still get a lot of things in Excel-format - perhaps this package solves the Excel-import inferno...
  • R6 The successor to reference classes - after working with the Gmisc::Transition-class I appreciate the need for a better system.

4 thoughts on “R trends in 2015 (based on cranlogs)

  1. Pingback: Distilled News | Data Analytics & R

  2. Looking at the CRAN downloads totals can be highly misleading due to dependencies which are automatically downloaded for install “desired package”. As you have the precise date and time of events you can detect which of packages has been installed automatically as dependency and which were the “desired packages”. While total downloads is still a measure, it is usually less valuable measure to detect the trends, which are driven by “desired package” downloads.

    • I agree completely, although detecting desired installs is challenging at best. E.g. how do you detect that I want both httr and devtools in
      install.packages(c(“httr”, “devtools”)) ?

      • Odd, I was certain that I had already replied to your questions.

        I completely agree with the limitations of the current metric. It took me a little longer than anticipated to compile and adding fancier adjustments was not within the time frame. If I’ll revisit the subject I’ll consider adding some more fancy statistics.

        One thought that I’ve had is to add the dependencies to each package. One can then look at how popular the dependencies are and reduce the downloads/day from that regression estimate. This could be a partial reduction as the packages can very well be useful on their own. A problem with this is that dependencies change over time, making this even trickier. I’m also not sure that CRAN would agree with me scraping their entire site…

        The large proportion of packages used in package development indicates that the dependency issue is huge. I would though argue that a package like checkmate will make other packages more useful and therefor should get merits in some way. This is probably also something that is symptomatic of the package explosion and arguably part of the R trend.

        Another thing to remember is that RStudio is currently dominating the IDE market. While being a RStudio fan the metrics get a little impacted by the IDE dominance. Their packages get most likely a boost as they are part of the RStudio-concept, and perhaps less because of their excellence (although they do produce very high quality packages in my mind). Still, them appearing in the lists indicates that they continue to dominate the IDE-market and are arguably also part of the current trend.

        One thing that is completely lacking from the current analysis is GitHub – arguably one of the biggest open software trends in recent years. I guess we won’t reach the truth anytime soon and this is partly why I added the (based on cranlogs) in the title.

Leave a Reply

Your email address will not be published. Required fields are marked *

This site uses Akismet to reduce spam. Learn how your comment data is processed.