Fast-track publishing using knitr: stitching it together (part V)

Putting all the pieces together can be challenging both for surgeons and researchers. The image is CC by Zac Peckler
Putting all the pieces together can be challenging both for surgeons and researchers. The image is CC by Zac Peckler

Fast-track publishing using knitr is a short series on how I use knitr to speedup publishing in my research. There has been plenty of feedback and interest for the series, and in this post I would like to provide (1) a brief summary and (2) an example showing how to put all the pieces together.

The series consists out of five posts:

  • First post – an intro motivating knitr in writing your manuscript and a comparison of knitr to Word options.
  • Second post – setting up a .RProfile and using a custom.css file.
  • Third post – getting your plots the way you want.
  • Fourth post – generating tables.
  • Fifth post – summary and example (current post).

Summary

The main idea of fast-track publishing is taking the reproducible research approach one step further by looking how we can combine the ideas of reproducible research with good layout, handling images, table generation, and MS Word-integration. The aim of each is:

  • Layout: if you stick to good layout practices your co-authors and reviewers will most likely have a faster response time.
  • Images: submitting and sharing images should be a no-brainer.
  • Tables: tables contain a lot of information and a lot of layout, having a good-looking standard solution saves you time.
  • MS Word integration: tracking changes and adding comments directly is vital when working on your manuscript. I dream of being able to share my knitr Rmd-files with my co-authors, unfortunately sharing a raw document with code is not an option.

My current way of doing this is by using knitr markdown with a custom.css together with some functions from my Gmisc-package. As some have suggested, interesting alternatives are Pandoc and R2DOCX, although I’ve found tables to be less flexible with those.

Lastly, I currently do not recommend writing your full document in knitr; focus on the data-specifics such as parts of the methods sections and the results section. You will otherwise spend too much time manually changing references and there is currently no simple way to get the rich bibliography types that Zotero, Endnote, and Mendeley provide.

Fast-track example

A knitr document mixes four different elements: plain text, code, tables, and figures. This is why it is called weaving/knitting a document. Below you can see the general idea of the document structure:

Knitr-document structure

To separate code from text, knitr markdown uses chunks; ```{r} indices start of a chunk while ``` indicates the end. To work nicely with RStudio you also need to remember to save your file with a .Rmd file ending, otherwise RStudio doesn’t know that it is a knitr markdown document.

The actual example (sorry, couldn’t get the syntax highlighting to work):

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
```{r Data_prep, echo=FALSE, message=FALSE, warning=FALSE}
# Moved this outside the document for easy of reading
# I often have those sections in here
source("Setup_and_munge.R")
```
 
```{r Versions}
info <- sessionInfo()
r_ver <- paste(info$R.version$major, info$R.version$minor, sep=".")
```
 
All analyses were performed using R (ver. `r r_ver`)[R Core Team, 2013] 
and packages rms (ver. `r info$otherPkgs$rms$Version`) [F. Harrell, 2014] 
for analysis, Gmisc for plot and table output (ver. `r info$otherPkgs$Gmisc$Version`), 
and knitr (ver `r info$otherPkgs$knitr$Version`) [Xie, 2013] for reproducible research.
 
Results
=======
 
We found `r nrow(melanoma)` patients with malignant melanoma between the years 
`r paste(range(melanoma$year), collapse=" and ")`. Patients were followed until 
the end of 1977, the median follow-up time was `r sprintf("%.1f", median(melanoma$time_years))` 
years (range `r paste(sprintf("%.1f", range(melanoma$time_years)), collapse=" to ")` years). 
Males were more common than females and had also a higher mortality rate.
 
```{r Table1, results='asis', cache=FALSE}
table_data <- list()
getT1Stat <- function(varname, digits=0){
  getDescriptionStatsBy(melanoma[, varname], melanoma$status, 
                        add_total_col=TRUE,
                        show_all_values=TRUE, 
                        hrzl_prop=TRUE,
                        statistics=FALSE, 
                        html=TRUE, 
                        digits=digits)
}
 
# Get the basic stats
table_data[["Sex"]] <- getT1Stat("sex")
table_data[["Age<sup>&dagger;</sup>"]] <- getT1Stat("age")
table_data[["Ulceration"]] <- getT1Stat("ulcer")
table_data[["Thickness<sup>&Dagger;</sup>"]] <- getT1Stat("thickness", digits=1)
 
# Now merge everything into a matrix
# and create the rgroup & n.rgroup variabels
rgroup <- c()
n.rgroup <- c()
output_data <- NULL
for (varlabel in names(table_data)){
  output_data <- rbind(output_data, table_data[[varlabel]])
  rgroup <- c(rgroup, varlabel)
  n.rgroup <- c(n.rgroup, nrow(table_data[[varlabel]]))
}
 
# Add a column spanner for the death columns
cgroup <- c("", "Death")
n.cgroup <- c(2, 2)
colnames(output_data) <- gsub("[ ]*death", "", colnames(output_data))
 
htmlTable(output_data, align="rrrr",
          rgroup=rgroup, n.rgroup=n.rgroup, 
          rgroupCSSseparator="", 
          cgroup = cgroup,
          n.cgroup = n.cgroup,
          rowlabel="", 
          caption="Baseline characteristics", 
          tfoot=paste0("<sup>&dagger;</sup> Age at the time of surgery.",
                       " <br/><sup>&Dagger;</sup> Tumour thicknes,",
                       " also known as Breslow thickness, measured in mm."), 
          ctable=TRUE)
```
 
Main results
------------
 
```{r C_and_A, results='asis'}
# Setup needed for the rms coxph wrapper
ddist <- datadist(melanoma)
options(datadist = "ddist")
 
# Do the cox regression model 
# for melanoma specific death
msurv <- Surv(melanoma$time_years, melanoma$status=="Melanoma death")
fit <- cph(msurv ~ sex + age + ulcer + thickness, data=melanoma)
 
# Print the model
printCrudeAndAdjustedModel(fit, desc_digits=0,
                           caption="Adjusted and unadjusted estimates for melanoma specific death.",
                           desc_column=TRUE,
                           add_references=TRUE, 
                           ctable=TRUE)
 
pvalues <- 
  1 - pchisq(coef(fit)^2/diag(vcov(fit)), df=1)
```
 
After adjusting for the three variables, age, sex, tumor thickness
 and ulceration, only the latter two remained significant (p-value 
`r pvalueFormatter(pvalues["ulcer=Present"], sig.limit=10^-3)` and
`r pvalueFormatter(pvalues["thickness"], sig.limit=10^-3)`), 
see table `r as.numeric(options("table_counter"))-1` and 
figure `r getNextFigureNo()`.
 
```{r Regression_forestplot, fig.height=3, fig.width=5, fig.cap="A foresplot comparing the regression coefficients."}
# I've adjusted the coefficient for age to be by 
forestplotRegrObj(update(fit, .~.-age+I(age/10)), 
                  order.regexps=c("Female", "age", "ulc", "thi"),
                  box.default.size=.25, xlog=TRUE,
                  new_page=TRUE, clip=c(.5, 6), rowname.fn=function(x){
  if (grepl("Female", x))
    return("Female")
 
  if (grepl("Present", x))
    return("Ulceration")
 
  if (grepl("age", x))
    return("Age/10 years")
 
  return(capitalize(x))
})
```
 
There was no strong indication for non-linearity for any of the 
continuous variables although the impact of thickness did 
seem to lessen above 4 mm, see figure `r getNextFigureNo()`.
 
```{r spline_plot, fig.cap=plotHR_cap}
plotHR_cap = paste0("The adjusted and unadjusted restricted cubic spline",
                    " for tumor thickness. Solid line and confidence interval",
                    " indicate the adjusted line while the dashed is",
                    " the unadjusted line. The grey area at ",
                    " the bottom indicates the density.")
# Generate adjusted and anuadjusted regression models
rcs_fit <- update(fit, .~.-thickness+rcs(thickness, 3))
rcs_fit_ua <- update(fit, .~+rcs(thickness, 3))
 
# Make sure the axes stay at the exact intended points
par(xaxs="i", yaxs="i")
plotHR(list(rcs_fit, rcs_fit_ua), col.dens="#00000033",
       lty.term=c(1, 2),
       col.term=c("blue", "#444444"), 
       col.se = c("#0000FF44", "grey"),
       polygon_ci=c(TRUE, FALSE),
       term="thickness", 
       xlab="Thickness (mm)", 
       ylim=c(.1, 4), xlim=c(min(melanoma$thickness), 4), 
       plot.bty="l", y.ticks=c(.1, .25, .5, 1, 2, 4))
legend(x=.1, y=1.1, legend=c("Adjusted", "Unadjusted"), fill=c("blue", "grey"), bty="n")
```

The external code for the first chunk:

?View Code RSPLUS
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
##################
# Knitr settings #
##################
 
# Don't set knitr options outside knitr
if ("package:knitr" %in% search()){
  # Set some basic options. You usually do not
  # want your code, messages, warnings etc
  # to show in your actual manuscript
  opts_chunk$set(warning=FALSE, 
                 message=FALSE, 
                 echo=FALSE, 
                 dpi=96,
                 fig.width=4, fig.height=4, # Default figure widths
                 dev="png", dev.args=list(type="cairo"), # The png device
                 # Change to dev="postscript" if you want the EPS-files
                 # for submitting. Also remove the dev.args() as the postscript
                 # doesn't accept the type="cairo" argument.
                 error=FALSE)
 
  # Evaluate the figure caption after the plot
  opts_knit$set(eval.after='fig.cap')
 
  # Avoid including base64_images - this only 
  # works with the .RProfile setup
  options(base64_images = "none")
 
  # Add a figure counter function
  knit_hooks$set(plot = function(x, options) {
    fig_fn = paste0(opts_knit$get("base.url"), 
                    paste(x, collapse = "."))
 
    # Some stuff from the default definition
    fig.cap <- knitr:::.img.cap(options)
 
    # Style and additional options that should be included in the img tag
    style=c("display: block",
            sprintf("margin: %s;",
                    switch(options$fig.align, 
                           left = 'auto auto auto 0', 
                           center = 'auto',
                           right = 'auto 0 auto auto')))
    # Certain arguments may not belong in style, 
    # for instance the width and height are usually
    # outside if the do not have a unit specified
    addon_args = ""
 
    # This is perhaps a little overly complicated prepared 
    # with the loop but it allows for a more out.parameters if necessary
    if (any(grepl("^out.(height|width)", names(options)))){
      on <- names(options)[grep("^out.(height|width)", names(options))]
      for(out_name in on){
        dimName <- substr(out_name, 5, nchar(out_name))
        if (grepl("[0-9]+(em|px|%|pt|pc|in|cm|mm)", out_name))
          style=append(style, paste0(dimName, ": ", options[[out_name]]))
        else if (length(options$out.width) > 0)
          addon_args = paste0(addon_args, dimName, "='", options[[out_name]], "'")
      }
    }
 
    # Add counter if wanted
    fig_number_txt <- ""
    cntr <- getOption("figure_counter", FALSE)
    if (cntr != FALSE){
      if (is.logical(cntr))
        cntr <- 1
      # The figure_counter_str allows for custom 
      # figure text, you may for instance want it in
      # bold: <b>Figure %s:</b>
      # The %s is so that you have the option of setting the
      # counter manually to 1a, 1b, etc if needed
      fig_number_txt <- 
        sprintf(getOption("figure_counter_str", "Figure %s: "), 
                ifelse(getOption("figure_counter_roman", FALSE), 
                       as.character(as.roman(cntr)), as.character(cntr)))
 
      if (is.numeric(cntr))
        options(figure_counter = cntr + 1)
    }
 
    # Put it all together
    paste0("<figure><img src='", fig_fn, "'", 
           " ", addon_args,
           paste0(" style='", paste(style, collapse="; "), "'"),
           ">",
           "<figcaption>", fig_number_txt, fig.cap, "</figcaption></figure>")
  })
 
}
 
# Use the table counter that the htmlTable() provides
options(table_counter = TRUE)
 
# Use the figure counter that we declare below
options(figure_counter = TRUE)
# Use roman letters (I, II, III, etc) for figures
options(figure_counter_roman = TRUE)
 
# Adding the figure number is a little tricky when the format is roman
getNextFigureNo <- function() as.character(as.roman(as.numeric(options("figure_counter"))))
 
#################
# Load_packages #
#################
library(rms) # I use the cox regression from this package
library(boot) # The melanoma data set is used in this exampe
library(Gmisc) # Stuff I find convenient
library(Greg) # You need to get this from my GitHub see http://gforge.se/Gmisc
 
##################
# Munge the data #
##################
 
# Here we go through and setup the variables so that
# they are in the proper format for the actual output
 
# Load the dataset - usually you would use read.csv
# or something similar
data("melanoma")
 
# Set time to years instead of days
melanoma$time_years <- 
  melanoma$time / 365.25
 
# Factor the basic variables that 
# we're interested in
melanoma$status <- 
  factor(melanoma$status, 
         levels=c(2, 1, 3),
         labels=c("Alive", # Reference
                  "Melanoma death", 
                  "Non-melanoma death"))
melanoma$sex <- 
  factor(melanoma$sex,
         labels=c("Male", # Reference
                  "Female"))
 
melanoma$ulcer <- 
  factor(melanoma$ulcer,
         levels=0:1,
         labels=c("Absent", # Reference
                  "Present"))

Together with the previous custom.css and .Rprofile generate this:

Example_document

You can find all the files that you need at the FTP-github page.

I hope you enjoyed the series and that you’ve find it useful. I wish that we would one day have a Word-alternative with track-changes, comments, version-handling etc that would allow true FTP, but until then this is my best alternative. Perhaps the talented people at RStudio can come up with something that fills this void?

flattr this!

This entry was posted in R, Tutorial and tagged , . Bookmark the permalink.

21 Responses to Fast-track publishing using knitr: stitching it together (part V)

  1. Just a quick note: Pandoc allows easy inclusion of bibliographic data from numerous data base formats, including Endnote. You could use various citation styles (from Zotero). So this may not be a deal breaker.

    See http://johnmacfarlane.net/pandoc/README.html#citations.

  2. Oleg Sheremet says:

    A question related to Part II: how to get math formulas displayed normally if using ‘custom.css’ instead of the original Markdown style file? After customizing css according to your example, my formulas look very small and some symbols are missing. Thanks!

    • Max Gordon says:

      I’ve actually never used math formulas in markdown – do they import well into Word? My guess is that if you have lots of math then LaTeX would be the best choice. If you could add some math formulas to my example I can see how to tweak the markdown, it should be rather easy.

  3. Frank says:

    “I wish that we would one day have a Word-alternative with track-changes, comments, version-handling etc that would allow true FTP, but until then this is my best alternative.”

    You might want to try integrating GitHub with the workflow you’ve described above. There’s now a Windows and Mac GUI making it really easy to learn.

    • Max Gordon says:

      I use GitHub for my package and I’ve thought about it but I doubt that it will work in the current format for my co-authors. What I would like to see is layer on top of the Rmd-file that integrates with GitHub allowing for the usual bubble-comments, strike-through text, etc that is applied directly to the HTML-file.

      This should be rather straight-forward to make and I can’t understand why Google-docs don’t offer this alternative. My hope is that RStudio/GitHub will integrate even closer, allowing for a comment/change suggestion layer – I doubt that any of my co-authors will accept anything anything containing code.

  4. Rafik says:

    Hi Max
    Congrats for excellent series of rmarkdown tips!
    I have cloned the example from github, but can not reproduce it. It throws an error for not finding from forest plot chunk “object ‘rn’ not found” and indicating this call: “Calls: knit … withVisible -> eval -> eval -> forestplotRegrObj -> grep”.
    Can you have a look into code?
    Many thanks.

    • Max Gordon says:

      Have you downloaded the latest Gmisc-package version? I hadn’t used the forestplotRegrObj function for a while and needed to fix it prior to running the example. I should probably some bigger changes to it but its currently a low priority.

      • Rafik says:

        I have 0.5.3.5 version. Where can I find dev version? Install_github() returns error.
        Thanks

        • Max Gordon says:

          Strange, devtools::install_github("gforge/Gmisc") works on my computers. Try the repository, I just updated with the latest version – 0.5.8.1.

          The package is still under heavy development and I recommend that you update it frequently. The interface is mostly stable, I mostly add arguments and as long as your using named arguments you should hopefully not run into that many issues (i.e. use argument=value).

  5. Rafik says:

    Hi, Max
    I have another question to ask.
    You have a figure counting function which is fine. But how one can make reference to certain figure inside the text, lets say in discussion part you have to refer to your first figure.
    Thanks.

    • Max Gordon says:

      The function getNextFigureNo() is defined in the external script and returns the next figure number. This number is stored in the options:

      getNextFigureNo < - function() as.character(as.roman(as.numeric(options("figure_counter"))))

      If you want the previous number you can try this function:

      getPrevFigureNo < - function() as.character(as.roman(as.numeric(options("figure_counter"))-1))

      If you are not using the roman letters then just remove the as.roman() section.

  6. Adam Smith says:

    Max,

    This is all very helpful; thank you! Perhaps I’m missing something obvious, but how can I get the htmlTables embedded in the output html file from knitr? When I run it, it opens up a browser for each html table and omits it from the output file. Thanks again.

    • Max Gordon says:

      Hi Adam,

      I think I’ve fixed this in the latest version. This happens with the latest RStudio (>0.98.932), right? The problem is that Yihui changed the knitr so that it runs in a clean environment without the package loaded and the print.htmlTable() function relied on it existing in order to omit the viewer call. In the latest update this problem should be fixed, let me know if it persists.

      You can also use the useViewer option and set it to FALSE, either directly or when calling the print call. The htmlTable() returns a matrix of the class htmlTable that is then passed on to the print.htmlTable() function that uses the useViewer option to decide if it should return the raw output or not.

      Hope this helps.
      Max

      • Adam Smith says:

        Thanks Max. I’m using RStudio 0.98.770 and Gmisc 0.6.5.

        If I do something like:

        myTable <- htmlTable(…)
        print(myTable, useViewer = FALSE)

        the table opens in the browser and shows up in the html output file. I can live with that…

        Thanks again,
        Adam

  7. Melinda Higgins says:

    I would appreciate your help with the following KnitHTML error. For the spline_plot (last) Figure, I keep getting an error when running “Knit HTML” from within R Studio. If, however, I simply move the caption up into the chunk options as follows – then I do not get the error. I’m curious if anyone else has experienced this. I’m not sure why creating a reference variable for the caption is causing the error.

    However, even with my adjustments below to get the HTML to complete processing, both Figures are created WITHOUT captions. Something is not working with the fig.cap= chunk option. Suggestions?

    R Chunk Options with my error fix – the edit below does allow the HTML to be created.

    “`{r spline_plot, fig.cap=”The adjusted and unadjusted restricted cubic spline for tumor thickness. Solid line and confidence interval indicate the adjusted line while the dashed is the unadjusted line. The grey area at the bottom indicates the density.”}

    R Chunk Options with the following errors:

    “`{r spline_plot, fig.cap=plotHR_cap}

    In the “R Markdown – Output” window, I get the following error

    Quitting from lines 112-131 (FTP_example.Rmd)
    Error in eval(expr, envir, enclos) : object ‘plotHR_cap’ not found
    Calls: … process_group.block -> call_block -> eval_lang -> eval -> eval
    Execution halted

    In the “R Markdown – Issues” window the error is:

    Line 112 Error in eval(expr, envir, enclose): object’plotHR_cap’ not found Calls …process_group.block -> call_block -> eval_lang -> eval -> eval Execution halted

    Here is my sessionInfo()

    R version 3.1.0 (2014-04-10)
    Platform: x86_64-w64-mingw32/x64 (64-bit)

    locale:
    [1] LC_COLLATE=English_United States.1252 LC_CTYPE=English_United States.1252
    [3] LC_MONETARY=English_United States.1252 LC_NUMERIC=C
    [5] LC_TIME=English_United States.1252

    attached base packages:
    [1] splines grid stats graphics grDevices utils datasets methods
    [9] base

    other attached packages:
    [1] Greg_0.7.0 Gmisc_0.6.6 Hmisc_3.14-4 Formula_1.1-1 survival_2.37-7
    [6] lattice_0.20-29

    loaded via a namespace (and not attached):
    [1] cluster_1.15.2 digest_0.6.4 htmltools_0.2.4 latticeExtra_0.6-26
    [5] RColorBrewer_1.0-5 rmarkdown_0.2.49 rms_4.2-0 sandwich_2.3-0
    [9] sp_1.0-15 SparseM_1.03 stringr_0.6.2 tools_3.1.0
    [13] zoo_1.7-11
    >

    • Max Gordon says:

      This is an error due to the new RStudio handling of markdown and knitr. The without the opts_knit$set(eval.after=”fig.cap”) knitr couldn’t find the figure caption variable. I’ve updated the Github version so that it works with the new rmarkdown, unfortunately there is no good way yet to solve how to exclude the embedded images, I’ve asked Yihui for guidance. The problem can be partly solved through the .RProfile available at the Github repository but it’s not the most elegant solution as you loose some of the new functionality in markdown v2.

  8. Melinda Higgins says:

    If possible, could more details be provided on how to get from RStudio using knitr through the steps to get to a valid DOCX file? I can create an HTML file but not sure what to do next to get everything into a DOCX file: the text, the tables and the figures.

    And thank you very much for these really helpful examples!!

    • Max Gordon says:

      Depending if you have images or not you can try to simply copy-paste everything into Word. If you have images and you want them into Word you may want to right-click on the html-file, open it in LibreOffice and then copy-paste to Word.

Leave a Reply

Your email address will not be published. Required fields are marked *

You may use these HTML tags and attributes: <a href="" title=""> <abbr title=""> <acronym title=""> <b> <blockquote cite=""> <cite> <code> <del datetime=""> <em> <i> <q cite=""> <strike> <strong>