# Too crude to be true?

The key to programming is being lazy; it has actually been called a virtue by some. When I discovered the update() function it blew me away. Within short I had created a monster based upon this tiny function, allowing quick and easy output of regression tables that contain crude and adjusted estimates. In this post I’ll show you how to tame the printCrudeAndAdjusted() function in my Gmisc-package and show a little behind the scenes.

Although not always present I think a regression table that shows both the crude and adjusted estimates conveys important information. The crude is also referred to as un-adjusted and if this sharply drops in the adjusted model we can suspect that there is a strong confounder that steals some of the effect that this variable has.

## A simple walk-through

We will start with basic usage of the function. For this I’m using the colon data set from the survival package. So first we prepare our data by generating a few factors:

?View Code RSPLUS
 1 2 3 4 5 6 7 8 9 10 11  library(survival) data(colon) # Choose th ones wih deah as outcome colon <- subset(colon, etype = 2) # Change into years colon$time <- colon$time/365.25 # Factor these binary variables or the software will interpret them as # numeric even if the regression is correct colon$sex <- factor(colon$sex, labels = c("Female", "Male")) colon$perfor <- factor(colon$perfor, labels = c("No", "Yes")) colon$obstruct <- factor(colon$obstruct, labels = c("No", "Yes"))

If the names are a little to shorthand and we want prettier names it is easy to set the label of the name using the Hmisc label() function:

?View Code RSPLUS
 1 2 3 4 5  library(Hmisc, verbose = FALSE) label(colon$rx) <- "Treatment" label(colon$perfor) <- "Colon perforation" label(colon$age) <- "Age (years)" label(colon$sex) <- "Sex"

The regression model is setup in the regular way. I use the rms-package for a lot of my work but regular lm(), glm() and coxph() should work just as well. If you’re not familiar with the rms-package it likes to have a summary on the data set that is stored in the datadist.

?View Code RSPLUS
 1 2 3 4 5 6  library(rms, verbose = FALSE) dd <- datadist(colon[, c("time", "status", "rx", "sex", "obstruct", "perfor", "age")]) options(datadist = "dd") sn <- Surv(colon$time, colon$status) fit <- cph(sn ~ rx + sex + perfor + age, data = colon)

And now it is time for my monster function. I like having the reference category in the table so I always add the add_reference option:

?View Code RSPLUS
 1 2  library(Gmisc) printCrudeAndAdjustedModel(fit, add_references = TRUE, ctable = TRUE)
HR 2.5 % to 97.5 %   HR 2.5 % to 97.5 %
Treatment
Obs 1 ref   1 ref
Lev 0.98 0.84 to 1.14   0.98 0.84 to 1.14
Lev+5FU 0.64 0.55 to 0.76   0.64 0.55 to 0.76
Sex
Female 1 ref   1 ref
Male 0.97 0.85 to 1.10   0.95 0.83 to 1.08
Colon perforation
No 1 ref   1 ref
Yes 1.30 0.92 to 1.85   1.30 0.91 to 1.85
Age (years) 1.00 0.99 to 1.00   1.00 0.99 to 1.00

In event data like this it is also nice to have some information regarding the variables raw distribution and we can easily combine the regression model with some descriptive data by adding the desc_column option:

?View Code RSPLUS
 1 2 3 4  printCrudeAndAdjustedModel(fit, desc_column=TRUE, add_references=TRUE, ctable=TRUE)
Total Event   HR 2.5 % to 97.5 %   HR 2.5 % to 97.5 %
Treatment
Obs 630 345 (54.76 %)   1 ref   1 ref
Lev 620 333 (53.71 %)   0.98 0.84 to 1.14   0.98 0.84 to 1.14
Lev+5FU 608 242 (39.80 %)   0.64 0.55 to 0.76   0.64 0.55 to 0.76
Sex
Female 890 444 (49.89 %)   1 ref   1 ref
Male 968 476 (49.17 %)   0.97 0.85 to 1.10   0.95 0.83 to 1.08
Colon perforation
No 1804 888 (49.22 %)   1 ref   1 ref
Yes 54 32 (59.26 %)   1.30 0.92 to 1.85   1.30 0.91 to 1.85
Age (years) 59.75 (± 11.95) 59.46 (± 12.35)   1.00 0.99 to 1.00   1.00 0.99 to 1.00

The function does not yet handle functions. Currently it skips these by removing any variable from the formula that matches the regular expression: “^(ns|rcs|bs|pspline)[(]“. In the future I image it could be possible to add something that uses the rms:::contrast() function, although that will probably wait until the need arises.

As I mentioned in the introduction the update function is at the heart of this function. It has the ability to on the fly add, remove, and change datasets and other nice options to your model.

?View Code RSPLUS
 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43  # Now lets keep everything the same and add the obstruct variable update(fit, . ~ . + obstruct) ## ## Cox Proportional Hazards Model ## ## cph(formula = sn ~ rx + sex + perfor + age + obstruct, data = colon) ## ## Model Tests Discrimination ## Indexes ## Obs 1858 LR chi2 44.77 R2 0.024 ## Events 920 d.f. 6 Dxy 0.126 ## Center -0.1992 Pr(> chi2) 0.0000 g 0.245 ## Score chi2 43.75 gr 1.277 ## Pr(> chi2) 0.0000 ## ## Coef S.E. Wald Z Pr(>|Z|) ## rx=Lev -0.0193 0.0769 -0.25 0.8022 ## rx=Lev+5FU -0.4368 0.0840 -5.20 <0.0001 ## sex=Male -0.0440 0.0662 -0.66 0.5066 ## perfor=Yes 0.1992 0.1819 1.10 0.2735 ## age -0.0012 0.0028 -0.44 0.6606 ## obstruct=Yes 0.2120 0.0817 2.60 0.0094   # We can also remove a variable update(fit, . ~ . - age) ## ## Cox Proportional Hazards Model ## ## cph(formula = sn ~ rx + sex + perfor, data = colon) ## ## Model Tests Discrimination ## Indexes ## Obs 1858 LR chi2 37.85 R2 0.020 ## Events 920 d.f. 4 Dxy 0.096 ## Center -0.1712 Pr(> chi2) 0.0000 g 0.214 ## Score chi2 36.43 gr 1.238 ## Pr(> chi2) 0.0000 ## ## Coef S.E. Wald Z Pr(>|Z|) ## rx=Lev -0.0205 0.0769 -0.27 0.7895 ## rx=Lev+5FU -0.4430 0.0839 -5.28 <0.0001 ## sex=Male -0.0522 0.0661 -0.79 0.4297 ## perfor=Yes 0.2674 0.1800 1.49 0.1374

When I do crude adjustments I simply remove all other variables:

?View Code RSPLUS
 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17  update(fit, . ~ rx) ## ## Cox Proportional Hazards Model ## ## cph(formula = sn ~ rx, data = colon) ## ## Model Tests Discrimination ## Indexes ## Obs 1858 LR chi2 35.23 R2 0.019 ## Events 920 d.f. 2 Dxy 0.089 ## Center -0.1512 Pr(> chi2) 0.0000 g 0.194 ## Score chi2 33.63 gr 1.215 ## Pr(> chi2) 0.0000 ## ## Coef S.E. Wald Z Pr(>|Z|) ## rx=Lev -0.0209 0.0768 -0.27 0.7859 ## rx=Lev+5FU -0.4408 0.0839 -5.25 <0.0001

Nice, isn’t it? I find it really useful as I’m comparing AIC/BIC-values, testing interactions etc.

## Where is the monster?

While the update() function is appealing in its simplicity there is much more that needed to be in place for this to work. The basic function needs to figure out the type of regression, loop through the variables, and check for the intercept. It still needs some work, this was one of my first functions that I created in my package so the coding is a little clumpsy – feel free to suggest improvements. You can find the raw code under my GitHub-account.

?View Code RSPLUS
 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 162 163 164 165 166 167 168 169 170 171 172 173 174 175 176 177 178 179 180 181 182 183 184 185 186 187 188 189 190 191 192 193 194 195 196 197 198 199 200 201 202 203 204 205 206 207 208 209 210 211 212 213 214 215 216 217 218 219 220 221 222 223 224 225 226 227 228 229 230 231 232 233 234 235  #' This function helps with printing regression models #' #' This function is used for getting the adjusted and unadjusted values #' for a regression model. It takes a full model and walks through each #' variable, removes in the regression all variables except one then #' reruns that variable to get the unadjusted value. This functions not #' intended for direct use, it's better to use \code{\link{printCrudeAndAdjustedModel}} #' that utilizes this function. #' #' This function saves a lot of time creating tables since it compiles a fully #' unadjusted list of all your used covariates. #' #' If the model is an exponential poisson/logit/cox regression model then it automatically #' reports the exp() values instead of the original values #' #' The function skips by default all spline variables since this becomes very complicated #' and there is no simple \deqn{\beta}{beta} to display. For the same reason it skips #' any interaction variables since it's probably better to display these as a contrast table. #' #' @param fit The regression model #' @return Returns a matrix with the columns c('Crude', '95\% CI', 'Adjusted', '95\% CI'). #' The rows are not changed from the original fit. #' #' @seealso \code{\link{printCrudeAndAdjustedModel}} #' #' @example examples/getCrudeAndAdjustedModelData_example.R #' #' @import boot #' #' @author max #' @export getCrudeAndAdjustedModelData <- function(fit) {   if ("Glm" %in% class(fit)) stop("The rms::Glm does not work properly with this function, please use regular glm instead")   # Just a prettifier for the output an alternative could be: # paste(round(x[,1],1), ' (95% CI ', min(round(x[,2:3])), '-', # max(round(x[,2:3])), ')', sep='') get_coef_and_ci <- function(fit, skip_intercept = FALSE) { # Get the coefficients my_coefficients <- coef(fit) coef_names <- names(my_coefficients)   # Confint isn't implemented for the rms package we therefore need to do a # work-around if ("rms" %in% class(fit) && "ols" %nin% class(fit)) { # The summary function has to be invoked to get this to work in the confint # version and the summary.rms doesn't return a profile.glm() compatible # list, it returns a matrix It works though OK for regular ordinary least # square regression if ("lrm" %in% class(fit)) { warning("Using the confint.default, i.e. Wald statistics, and it may not be optimal but profile CI is not implemented for the lrm, an option could be to change to the Glm with the family=binomial") ci <- confint.default(fit) } else if ("Glm" %in% class(fit)) { # The summary.rms is causing issues when profiling, it works fine when # summary.glm is issued, we need to change class for that class(fit) <- class(fit)["rms" != class(fit)] ci <- confint(fit) } else if ("cph" %in% class(fit)) { # I haven't found any large issues with this call ci <- confint(fit) } else { # Untested option try confint for better or worse warning("The function has not been tested with this type of regression object, may not work properly.") ci <- confint(fit) } } else { ci <- confint(fit) }   if (skip_intercept) { intercept <- grep("Intercept", coef_names) if (length(intercept)) { my_coefficients <- my_coefficients[-intercept] ci <- ci[-intercept, ] coef_names <- coef_names[-intercept] } }   # Use the exp() if logit or cox regression if ("coxph" %in% class(fit) || ("glm" %in% class(fit) && fit$family$link %in% c("logit", "log"))) { my_coefficients <- exp(my_coefficients) ci <- exp(ci) }   if (length(my_coefficients) > 1) ret_val <- cbind(my_coefficients, ci) else ret_val <- matrix(c(my_coefficients, ci), nrow = 1)   colnames(ret_val) <- c("", "2.5%", "97.5%") rownames(ret_val) <- coef_names return(ret_val) }   all.terms <- terms(fit) var_names <- attr(all.terms, "term.labels")   # Skip variables consisting of functions such as spline, strata variables regex_for_unwanted_vars <- "^(ns|rcs|bs|pspline)[(]" skip_variables <- grep(regex_for_unwanted_vars, var_names) skip_variables_names <- c() if (length(skip_variables) > 0) { for (vn in skip_variables) { match <- gregexpr("\\w+\\((?\\w+)", var_names[vn], perl = TRUE)[[1]] if (match[1] >= 0) { st <- attr(match, "capture.start")[1, ] skip_variables_names <- append(skip_variables_names, substring(var_names[vn], st, st + attr(match, "capture.length")[1, ] - 1)) } } }   # Skips the interaction terms since they should be displayed in a different # fashion raw_int_terms <- grep("[[:alnum:]]+:[[:alnum:]]+", var_names) if (length(raw_int_terms) > 0) { require("stringr") for (i in raw_int_terms) { for (name in str_split(var_names[i], ":")[[1]]) { skip_variables_names <- append(skip_variables_names, name) skip_variables <- union(skip_variables, grep(name, var_names)) } } }   skip_variables <- union(skip_variables, grep("^strat[a]{0,1}\\(", var_names))   # Get the adjusted variables adjusted <- get_coef_and_ci(fit) # When using splines, rcs in cox regression this shows a little different   # Remove all the splines, rcs etc rn <- rownames(adjusted) remove <- grep("('{1,}|[[][0-9]+[]]|[)][0-9]+)$", rn) for (name in skip_variables_names) { m <- grep(name, rn) if (length(m) > 0) remove <- union(remove, m) } # Remove the unwanted rows if any found if (length(remove) > 0) { adjusted <- adjusted[-remove, ] } unadjusted <- c() if (length(skip_variables) > 0) { variables_to_check <- var_names[-skip_variables] } else { variables_to_check <- var_names } if (length(variables_to_check) == 0) stop("You have no variables that can be displayed as adjusted/unadjusted since they all are part of an interaction, spline or strata") for (variable in variables_to_check) { interaction_variable <- length(grep(":", variable)) > 0 # If it's an interaction variable the interacting parts have to be included if (interaction_variable) { variable <- paste(paste(unlist(strsplit(variable, ":")), sep = "", collapse = " + "), variable, sep = " + ") } # Run the same fit but with only one variable fit_only1 <- update(fit, paste(".~", variable)) if ("boot.coef" %in% names(fit)) { # Redo bootstrap if this was a bootstrapped rms model by rerunning the # bootcov part fit_only1 <- bootcov(fit_only1, B = fit$B, coef.reps = "boot.Coef" %in% names(fit)) }   # Get the coefficients processed with some advanced round part() new_vars <- get_coef_and_ci(fit_only1, skip_intercept = TRUE)   # If interaction then we should only keep the interaction part - the other # variables are always included by default and need therefore to be removed if (interaction_variable) { new_vars <- new_vars[grep("[*:]", rownames(new_vars)), ] }   # Add them to the previous unadjusted <- rbind(unadjusted, new_vars) }   # If regression contains Intercept this is meaningless for the comparison of # adjusted and unadjusted values if (length(grep("Intercept", rownames(adjusted))) > 0) { if ("ols" %in% class(fit)) { # The simple .~1 doesn't work for me with the ols and I've therefore created # this workaround if ("boot.coef" %in% names(fit)) { if ("y" %nin% names(fit)) { warning("If you want the intercept on a bootstrapped ols() model then you need to specify y=TRUE or else you get NA in the unadjusted estimates") new_vars <- rep(NA, times = 3) } else { require("boot") || stop("boot' package not found - it's needed for the unadjusted estimate of the intercept") ci <- boot.ci(boot(fit$y, function(x, i) mean(x[i], na.rm = TRUE), R = fit$B, stype = "i"), type = ifelse("boot.Coef" %in% names(fit), "perc", "norm")) new_vars <- c(mean(fit$y, na.rm = TRUE), ci$percent[4:5]) } } else { new_vars <- c(mean(fit$y, na.rm = TRUE), mean(fit$y) + c(-1, 1) * sd(fit$y)/sqrt(sum(is.na(fit$y) == FALSE)) * qnorm(0.05/2)) }   } else { # Run the same fit but with only one variable fit_only1 <- update(fit, ".~ 1") if ("boot.coef" %in% names(fit)) { # Redo bootstrap if this was a bootstrapped rms model by rerunning the # bootcov part fit_only1 <- bootcov(fit_only1, B = fit\$B, coef.reps = "boot.Coef" %in% names(fit)) }   # Get the coefficients processed with some advanced round part() new_vars <- get_coef_and_ci(fit_only1, skip_intercept = FALSE) }   unadjusted <- rbind(new_vars, unadjusted) rownames(unadjusted)[1] <- "Intercept" }   # If just one variable it's not a proper matrix if (is.null(dim(adjusted))) { both <- matrix(c(unadjusted, adjusted), nrow = 1) } else { both <- cbind(unadjusted, adjusted) } colnames(both) <- c("Crude", "2.5 %", "97.5 %", "Adjusted", "2.5 %", "97.5 %") return(both) }`

I hope you’ll find this as useful as I have.

This entry was posted in R, Tutorial. Bookmark the permalink.

### 6 Responses to Too crude to be true?

1. Shige says:

Excellent work. Very cool.

2. Harold Baize says:

Nice use of a photo of “Crude Awakening” from Burning Man 2007. Are you a burner?

3. Tomas Karpati says:

Wonderful work. The only thing I miss in the tables are p-values or the stars on the side… (or eventually, colouring those rows which are statistically significant)

• Max Gordon says:

Thanks. The main problem with p-values is that there is no pval() function that would correspond to the confint(). Every time I need raw p-values I look at the print/summary functions and it is difficult to generalize. I guess the vcov() could be used but I’m uncomfortable inventing my own pvalue calculator – I have for instance no idea how to calculate p-values for quantile regression.