Too crude to be true?

Harvesting the magical update() function. The image is CC by Waldemar Horwat.
Harvesting the magical update() function. The image is CC by Waldemar Horwat.

The key to programming is being lazy; it has actually been called a virtue by some. When I discovered the update() function it blew me away. Within short I had created a monster based upon this tiny function, allowing quick and easy output of regression tables that contain crude and adjusted estimates. In this post I’ll show you how to tame the printCrudeAndAdjusted() function in my Greg-package and show a little behind the scenes.

Although not always present I think a regression table that shows both the crude and adjusted estimates conveys important information. The crude is also referred to as un-adjusted and if this sharply drops in the adjusted model we can suspect that there is a strong confounder that steals some of the effect that this variable has.

A simple walk-through

We will start with basic usage of the function. For this I’m using the colon data set from the survival package. So first we prepare our data by generating a few factors:

?View Code RSPLUS
1
2
3
4
5
6
7
8
9
10
11
library(survival)
data(colon)
# Choose th ones wih deah as outcome
colon <- subset(colon, etype = 2)
# Change into years
colon$time <- colon$time/365.25
# Factor these binary variables or the software will interpret them as
# numeric even if the regression is correct
colon$sex <- factor(colon$sex, labels = c("Female", "Male"))
colon$perfor <- factor(colon$perfor, labels = c("No", "Yes"))
colon$obstruct <- factor(colon$obstruct, labels = c("No", "Yes"))

If the names are a little to shorthand and we want prettier names it is easy to set the label of the name using the Hmisc label() function:

?View Code RSPLUS
1
2
3
4
5
library(Hmisc, verbose = FALSE)
label(colon$rx) <- "Treatment"
label(colon$perfor) <- "Colon perforation"
label(colon$age) <- "Age (years)"
label(colon$sex) <- "Sex"

The regression model is setup in the regular way. I use the rms-package for a lot of my work but regular lm(), glm() and coxph() should work just as well. If you’re not familiar with the rms-package it likes to have a summary on the data set that is stored in the datadist.

?View Code RSPLUS
1
2
3
4
5
6
library(rms, verbose = FALSE)
dd <- datadist(colon[, c("time", "status", "rx", "sex", "obstruct", "perfor", 
    "age")])
options(datadist = "dd")
sn <- Surv(colon$time, colon$status)
fit <- cph(sn ~ rx + sex + perfor + age, data = colon)

And now it is time for my monster function. I like having the reference category in the table so I always add the add_reference option:

?View Code RSPLUS
1
2
library(Greg)
printCrudeAndAdjustedModel(fit, add_references = TRUE, ctable = TRUE)
Variable Crude   Adjusted
  HR 2.5 % to 97.5 %   HR 2.5 % to 97.5 %
Treatment
  Obs 1 ref   1 ref
  Lev 0.98 0.84 to 1.14   0.98 0.84 to 1.14
  Lev+5FU 0.64 0.55 to 0.76   0.64 0.55 to 0.76
Sex
  Female 1 ref   1 ref
  Male 0.97 0.85 to 1.10   0.95 0.83 to 1.08
Colon perforation
  No 1 ref   1 ref
  Yes 1.30 0.92 to 1.85   1.30 0.91 to 1.85
Age (years) 1.00 0.99 to 1.00   1.00 0.99 to 1.00

In event data like this it is also nice to have some information regarding the variables raw distribution and we can easily combine the regression model with some descriptive data by adding the desc_column option:

?View Code RSPLUS
1
2
3
4
printCrudeAndAdjustedModel(fit, 
                           desc_column=TRUE,
                           add_references=TRUE, 
                           ctable=TRUE)
Variable     Crude   Adjusted
  Total Event   HR 2.5 % to 97.5 %   HR 2.5 % to 97.5 %
Treatment
  Obs 630 345 (54.76 %)   1 ref   1 ref
  Lev 620 333 (53.71 %)   0.98 0.84 to 1.14   0.98 0.84 to 1.14
  Lev+5FU 608 242 (39.80 %)   0.64 0.55 to 0.76   0.64 0.55 to 0.76
Sex
  Female 890 444 (49.89 %)   1 ref   1 ref
  Male 968 476 (49.17 %)   0.97 0.85 to 1.10   0.95 0.83 to 1.08
Colon perforation
  No 1804 888 (49.22 %)   1 ref   1 ref
  Yes 54 32 (59.26 %)   1.30 0.92 to 1.85   1.30 0.91 to 1.85
Age (years) 59.75 (± 11.95) 59.46 (± 12.35)   1.00 0.99 to 1.00   1.00 0.99 to 1.00

The function does not yet handle functions. Currently it skips these by removing any variable from the formula that matches the regular expression: “^(ns|rcs|bs|pspline)[(]“. In the future I image it could be possible to add something that uses the rms:::contrast() function, although that will probably wait until the need arises.

What about the update()

As I mentioned in the introduction the update function is at the heart of this function. It has the ability to on the fly add, remove, and change datasets and other nice options to your model.

?View Code RSPLUS
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
# Now lets keep everything the same and add the obstruct variable
update(fit, . ~ . + obstruct)
## 
## Cox Proportional Hazards Model
## 
## cph(formula = sn ~ rx + sex + perfor + age + obstruct, data = colon)
## 
##                      Model Tests       Discrimination    
##                                           Indexes        
## Obs       1858    LR chi2     44.77    R2       0.024    
## Events     920    d.f.            6    Dxy      0.126    
## Center -0.1992    Pr(> chi2) 0.0000    g        0.245    
##                   Score chi2  43.75    gr       1.277    
##                   Pr(> chi2) 0.0000                      
## 
##              Coef    S.E.   Wald Z Pr(>|Z|)
## rx=Lev       -0.0193 0.0769 -0.25  0.8022  
## rx=Lev+5FU   -0.4368 0.0840 -5.20  <0.0001 
## sex=Male     -0.0440 0.0662 -0.66  0.5066  
## perfor=Yes    0.1992 0.1819  1.10  0.2735  
## age          -0.0012 0.0028 -0.44  0.6606  
## obstruct=Yes  0.2120 0.0817  2.60  0.0094
 
# We can also remove a variable
update(fit, . ~ . - age)
## 
## Cox Proportional Hazards Model
## 
## cph(formula = sn ~ rx + sex + perfor, data = colon)
## 
##                      Model Tests       Discrimination    
##                                           Indexes        
## Obs       1858    LR chi2     37.85    R2       0.020    
## Events     920    d.f.            4    Dxy      0.096    
## Center -0.1712    Pr(> chi2) 0.0000    g        0.214    
##                   Score chi2  36.43    gr       1.238    
##                   Pr(> chi2) 0.0000                      
## 
##            Coef    S.E.   Wald Z Pr(>|Z|)
## rx=Lev     -0.0205 0.0769 -0.27  0.7895  
## rx=Lev+5FU -0.4430 0.0839 -5.28  <0.0001 
## sex=Male   -0.0522 0.0661 -0.79  0.4297  
## perfor=Yes  0.2674 0.1800  1.49  0.1374

When I do crude adjustments I simply remove all other variables:

?View Code RSPLUS
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
update(fit, . ~ rx)
## 
## Cox Proportional Hazards Model
## 
## cph(formula = sn ~ rx, data = colon)
## 
##                      Model Tests       Discrimination    
##                                           Indexes        
## Obs       1858    LR chi2     35.23    R2       0.019    
## Events     920    d.f.            2    Dxy      0.089    
## Center -0.1512    Pr(> chi2) 0.0000    g        0.194    
##                   Score chi2  33.63    gr       1.215    
##                   Pr(> chi2) 0.0000                      
## 
##            Coef    S.E.   Wald Z Pr(>|Z|)
## rx=Lev     -0.0209 0.0768 -0.27  0.7859  
## rx=Lev+5FU -0.4408 0.0839 -5.25  <0.0001

Nice, isn’t it? I find it really useful as I’m comparing AIC/BIC-values, testing interactions etc.

Where is the monster?

While the update() function is appealing in its simplicity there is much more that needed to be in place for this to work. The basic function needs to figure out the type of regression, loop through the variables, and check for the intercept. It still needs some work, this was one of my first functions that I created in my package so the coding is a little clumpsy – feel free to suggest improvements. You can find the raw code under my GitHub-account.

?View Code RSPLUS
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
#' This function helps with printing regression models
#' 
#' This function is used for getting the adjusted and unadjusted values
#' for a regression model. It takes a full model and walks through each
#' variable, removes in the regression all variables except one then
#' reruns that variable to get the unadjusted value. This functions not
#' intended for direct use, it's better to use \code{\link{printCrudeAndAdjustedModel}}
#' that utilizes this function.
#' 
#' This function saves a lot of time creating tables since it compiles a fully
#' unadjusted list of all your used covariates.
#' 
#' If the model is an exponential poisson/logit/cox regression model then it automatically
#' reports the exp() values instead of the original values
#' 
#' The function skips by default all spline variables since this becomes very complicated
#' and there is no simple \deqn{\beta}{beta} to display. For the same reason it skips
#' any interaction variables since it's probably better to display these as a contrast table. 
#' 
#' @param fit The regression model
#' @return Returns a matrix with the columns c('Crude', '95\% CI', 'Adjusted', '95\% CI').
#'   The rows are not changed from the original fit.
#'
#' @seealso \code{\link{printCrudeAndAdjustedModel}}
#' 
#' @example examples/getCrudeAndAdjustedModelData_example.R
#' 
#' @import boot
#' 
#' @author max
#' @export
getCrudeAndAdjustedModelData <- function(fit) {
 
    if ("Glm" %in% class(fit)) 
        stop("The rms::Glm does not work properly with this function, please use regular glm instead")
 
    # Just a prettifier for the output an alternative could be:
    # paste(round(x[,1],1), ' (95% CI ', min(round(x[,2:3])), '-',
    # max(round(x[,2:3])), ')', sep='')
    get_coef_and_ci <- function(fit, skip_intercept = FALSE) {
        # Get the coefficients
        my_coefficients <- coef(fit)
        coef_names <- names(my_coefficients)
 
        # Confint isn't implemented for the rms package we therefore need to do a
        # work-around
        if ("rms" %in% class(fit) && "ols" %nin% class(fit)) {
            # The summary function has to be invoked to get this to work in the confint
            # version and the summary.rms doesn't return a profile.glm() compatible
            # list, it returns a matrix It works though OK for regular ordinary least
            # square regression
            if ("lrm" %in% class(fit)) {
                warning("Using the confint.default, i.e. Wald statistics, and it may not be optimal but profile CI is not implemented for the lrm, an option could be to change to the Glm with the family=binomial")
                ci <- confint.default(fit)
            } else if ("Glm" %in% class(fit)) {
                # The summary.rms is causing issues when profiling, it works fine when
                # summary.glm is issued, we need to change class for that
                class(fit) <- class(fit)["rms" != class(fit)]
                ci <- confint(fit)
            } else if ("cph" %in% class(fit)) {
                # I haven't found any large issues with this call
                ci <- confint(fit)
            } else {
                # Untested option try confint for better or worse
                warning("The function has not been tested with this type of regression object, may not work properly.")
                ci <- confint(fit)
            }
        } else {
            ci <- confint(fit)
        }
 
        if (skip_intercept) {
            intercept <- grep("Intercept", coef_names)
            if (length(intercept)) {
                my_coefficients <- my_coefficients[-intercept]
                ci <- ci[-intercept, ]
                coef_names <- coef_names[-intercept]
            }
        }
 
        # Use the exp() if logit or cox regression
        if ("coxph" %in% class(fit) || ("glm" %in% class(fit) && fit$family$link %in% 
            c("logit", "log"))) {
            my_coefficients <- exp(my_coefficients)
            ci <- exp(ci)
        }
 
        if (length(my_coefficients) > 1) 
            ret_val <- cbind(my_coefficients, ci) else ret_val <- matrix(c(my_coefficients, ci), nrow = 1)
 
        colnames(ret_val) <- c("", "2.5%", "97.5%")
        rownames(ret_val) <- coef_names
        return(ret_val)
    }
 
    all.terms <- terms(fit)
    var_names <- attr(all.terms, "term.labels")
 
    # Skip variables consisting of functions such as spline, strata variables
    regex_for_unwanted_vars <- "^(ns|rcs|bs|pspline)[(]"
    skip_variables <- grep(regex_for_unwanted_vars, var_names)
    skip_variables_names <- c()
    if (length(skip_variables) > 0) {
        for (vn in skip_variables) {
            match <- gregexpr("\\w+\\((?<var_name>\\w+)", var_names[vn], perl = TRUE)[[1]]
            if (match[1] >= 0) {
                st <- attr(match, "capture.start")[1, ]
                skip_variables_names <- append(skip_variables_names, substring(var_names[vn], 
                  st, st + attr(match, "capture.length")[1, ] - 1))
            }
        }
    }
 
    # Skips the interaction terms since they should be displayed in a different
    # fashion
    raw_int_terms <- grep("[[:alnum:]]+:[[:alnum:]]+", var_names)
    if (length(raw_int_terms) > 0) {
        require("stringr")
        for (i in raw_int_terms) {
            for (name in str_split(var_names[i], ":")[[1]]) {
                skip_variables_names <- append(skip_variables_names, name)
                skip_variables <- union(skip_variables, grep(name, var_names))
            }
        }
    }
 
    skip_variables <- union(skip_variables, grep("^strat[a]{0,1}\\(", var_names))
 
    # Get the adjusted variables
    adjusted <- get_coef_and_ci(fit)
    # When using splines, rcs in cox regression this shows a little different
 
    # Remove all the splines, rcs etc
    rn <- rownames(adjusted)
    remove <- grep("('{1,}|[[][0-9]+[]]|[)][0-9]+)$", rn)
    for (name in skip_variables_names) {
        m <- grep(name, rn)
        if (length(m) > 0) 
            remove <- union(remove, m)
    }
 
    # Remove the unwanted rows if any found
    if (length(remove) > 0) {
        adjusted <- adjusted[-remove, ]
    }
 
    unadjusted <- c()
    if (length(skip_variables) > 0) {
        variables_to_check <- var_names[-skip_variables]
    } else {
        variables_to_check <- var_names
    }
    if (length(variables_to_check) == 0) 
        stop("You have no variables that can be displayed as adjusted/unadjusted since they all are part of an interaction, spline or strata")
 
    for (variable in variables_to_check) {
        interaction_variable <- length(grep(":", variable)) > 0
 
        # If it's an interaction variable the interacting parts have to be included
        if (interaction_variable) {
            variable <- paste(paste(unlist(strsplit(variable, ":")), sep = "", 
                collapse = " + "), variable, sep = " + ")
        }
 
        # Run the same fit but with only one variable
        fit_only1 <- update(fit, paste(".~", variable))
        if ("boot.coef" %in% names(fit)) {
            # Redo bootstrap if this was a bootstrapped rms model by rerunning the
            # bootcov part
            fit_only1 <- bootcov(fit_only1, B = fit$B, coef.reps = "boot.Coef" %in% 
                names(fit))
        }
 
        # Get the coefficients processed with some advanced round part()
        new_vars <- get_coef_and_ci(fit_only1, skip_intercept = TRUE)
 
        # If interaction then we should only keep the interaction part - the other
        # variables are always included by default and need therefore to be removed
        if (interaction_variable) {
            new_vars <- new_vars[grep("[*:]", rownames(new_vars)), ]
        }
 
        # Add them to the previous
        unadjusted <- rbind(unadjusted, new_vars)
    }
 
    # If regression contains Intercept this is meaningless for the comparison of
    # adjusted and unadjusted values
    if (length(grep("Intercept", rownames(adjusted))) > 0) {
        if ("ols" %in% class(fit)) {
            # The simple .~1 doesn't work for me with the ols and I've therefore created
            # this workaround
            if ("boot.coef" %in% names(fit)) {
                if ("y" %nin% names(fit)) {
                  warning("If you want the intercept on a bootstrapped ols() model then you need to specify y=TRUE or else you get NA in the unadjusted estimates")
                  new_vars <- rep(NA, times = 3)
                } else {
                  require("boot") || stop("`boot' package not found - it's needed for the unadjusted estimate of the intercept")
                  ci <- boot.ci(boot(fit$y, function(x, i) mean(x[i], na.rm = TRUE), 
                    R = fit$B, stype = "i"), type = ifelse("boot.Coef" %in% 
                    names(fit), "perc", "norm"))
                  new_vars <- c(mean(fit$y, na.rm = TRUE), ci$percent[4:5])
                }
            } else {
                new_vars <- c(mean(fit$y, na.rm = TRUE), mean(fit$y) + c(-1, 
                  1) * sd(fit$y)/sqrt(sum(is.na(fit$y) == FALSE)) * qnorm(0.05/2))
            }
 
        } else {
            # Run the same fit but with only one variable
            fit_only1 <- update(fit, ".~ 1")
            if ("boot.coef" %in% names(fit)) {
                # Redo bootstrap if this was a bootstrapped rms model by rerunning the
                # bootcov part
                fit_only1 <- bootcov(fit_only1, B = fit$B, coef.reps = "boot.Coef" %in% 
                  names(fit))
            }
 
            # Get the coefficients processed with some advanced round part()
            new_vars <- get_coef_and_ci(fit_only1, skip_intercept = FALSE)
        }
 
        unadjusted <- rbind(new_vars, unadjusted)
        rownames(unadjusted)[1] <- "Intercept"
    }
 
    # If just one variable it's not a proper matrix
    if (is.null(dim(adjusted))) {
        both <- matrix(c(unadjusted, adjusted), nrow = 1)
    } else {
        both <- cbind(unadjusted, adjusted)
    }
    colnames(both) <- c("Crude", "2.5 %", "97.5 %", "Adjusted", "2.5 %", "97.5 %")
    return(both)
}

I hope you’ll find this as useful as I have.

flattr this!

This entry was posted in R, Tutorial. Bookmark the permalink.

6 Responses to Too crude to be true?

  1. Shige says:

    Excellent work. Very cool.

  2. Harold Baize says:

    Nice use of a photo of “Crude Awakening” from Burning Man 2007. Are you a burner?

  3. Tomas Karpati says:

    Wonderful work. The only thing I miss in the tables are p-values or the stars on the side… (or eventually, colouring those rows which are statistically significant)

    • Max Gordon says:

      Thanks. The main problem with p-values is that there is no pval() function that would correspond to the confint(). Every time I need raw p-values I look at the print/summary functions and it is difficult to generalize. I guess the vcov() could be used but I’m uncomfortable inventing my own pvalue calculator – I have for instance no idea how to calculate p-values for quantile regression.

Leave a Reply

Your email address will not be published. Required fields are marked *

You may use these HTML tags and attributes: <a href="" title=""> <abbr title=""> <acronym title=""> <b> <blockquote cite=""> <cite> <code> <del datetime=""> <em> <i> <q cite=""> <strike> <strong>