The key to programming is being lazy; it has actually been called a virtue by some. When I discovered the update() function it blew me away. Within short I had created a monster based upon this tiny function, allowing quick and easy output of regression tables that contain crude and adjusted estimates. In this post I’ll show you how to tame the printCrudeAndAdjusted() function in my Greg-package and show a little behind the scenes.
Although not always present I think a regression table that shows both the crude and adjusted estimates conveys important information. The crude is also referred to as un-adjusted and if this sharply drops in the adjusted model we can suspect that there is a strong confounder that steals some of the effect that this variable has.
A simple walk-through
We will start with basic usage of the function. For this I’m using the colon data set from the survival package. So first we prepare our data by generating a few factors:
library(survival)
data(colon)
# Choose th ones wih deah as outcome
colon <- subset(colon, etype = 2)
# Change into years
colon$time <- colon$time/365.25
# Factor these binary variables or the software will interpret them as
# numeric even if the regression is correct
colon$sex <- factor(colon$sex, labels = c("Female", "Male"))
colon$perfor <- factor(colon$perfor, labels = c("No", "Yes"))
colon$obstruct <- factor(colon$obstruct, labels = c("No", "Yes"))
If the names are a little to shorthand and we want prettier names it is easy to set the label of the name using the Hmisc label() function:
library(Hmisc, verbose = FALSE)
label(colon$rx) <- "Treatment"
label(colon$perfor) <- "Colon perforation"
label(colon$age) <- "Age (years)"
label(colon$sex) <- "Sex"
The regression model is setup in the regular way. I use the rms-package for a lot of my work but regular lm(), glm() and coxph() should work just as well. If you're not familiar with the rms-package it likes to have a summary on the data set that is stored in the datadist.
library(rms, verbose = FALSE)
dd <- datadist(colon[, c("time", "status", "rx", "sex", "obstruct", "perfor",
"age")])
options(datadist = "dd")
sn <- Surv(colon$time, colon$status)
fit <- cph(sn ~ rx + sex + perfor + age, data = colon)
And now it is time for my monster function. I like having the reference category in the table so I always add the add_reference option:
library(Greg)
printCrudeAndAdjustedModel(fit, add_references = TRUE, ctable = TRUE)
Variable | Crude | Adjusted | |||
---|---|---|---|---|---|
HR | 2.5 % to 97.5 % | HR | 2.5 % to 97.5 % | ||
Treatment | |||||
Obs | 1 | ref | 1 | ref | |
Lev | 0.98 | 0.84 to 1.14 | 0.98 | 0.84 to 1.14 | |
Lev+5FU | 0.64 | 0.55 to 0.76 | 0.64 | 0.55 to 0.76 | |
Sex | |||||
Female | 1 | ref | 1 | ref | |
Male | 0.97 | 0.85 to 1.10 | 0.95 | 0.83 to 1.08 | |
Colon perforation | |||||
No | 1 | ref | 1 | ref | |
Yes | 1.30 | 0.92 to 1.85 | 1.30 | 0.91 to 1.85 | |
Age (years) | 1.00 | 0.99 to 1.00 | 1.00 | 0.99 to 1.00 |
In event data like this it is also nice to have some information regarding the variables raw distribution and we can easily combine the regression model with some descriptive data by adding the desc_column option:
printCrudeAndAdjustedModel(fit,
desc_column=TRUE,
add_references=TRUE,
ctable=TRUE)
Variable | Crude | Adjusted | ||||||
---|---|---|---|---|---|---|---|---|
Total | Event | HR | 2.5 % to 97.5 % | HR | 2.5 % to 97.5 % | |||
Treatment | ||||||||
Obs | 630 | 345 (54.76 %) | 1 | ref | 1 | ref | ||
Lev | 620 | 333 (53.71 %) | 0.98 | 0.84 to 1.14 | 0.98 | 0.84 to 1.14 | ||
Lev+5FU | 608 | 242 (39.80 %) | 0.64 | 0.55 to 0.76 | 0.64 | 0.55 to 0.76 | ||
Sex | ||||||||
Female | 890 | 444 (49.89 %) | 1 | ref | 1 | ref | ||
Male | 968 | 476 (49.17 %) | 0.97 | 0.85 to 1.10 | 0.95 | 0.83 to 1.08 | ||
Colon perforation | ||||||||
No | 1804 | 888 (49.22 %) | 1 | ref | 1 | ref | ||
Yes | 54 | 32 (59.26 %) | 1.30 | 0.92 to 1.85 | 1.30 | 0.91 to 1.85 | ||
Age (years) | 59.75 (± 11.95) | 59.46 (± 12.35) | 1.00 | 0.99 to 1.00 | 1.00 | 0.99 to 1.00 |
The function does not yet handle functions. Currently it skips these by removing any variable from the formula that matches the regular expression: "^(ns|rcs|bs|pspline)[(]". In the future I image it could be possible to add something that uses the rms:::contrast() function, although that will probably wait until the need arises.
What about the update()
As I mentioned in the introduction the update function is at the heart of this function. It has the ability to on the fly add, remove, and change datasets and other nice options to your model.
# Now lets keep everything the same and add the obstruct variable
update(fit, . ~ . + obstruct)
##
## Cox Proportional Hazards Model
##
## cph(formula = sn ~ rx + sex + perfor + age + obstruct, data = colon)
##
## Model Tests Discrimination
## Indexes
## Obs 1858 LR chi2 44.77 R2 0.024
## Events 920 d.f. 6 Dxy 0.126
## Center -0.1992 Pr(> chi2) 0.0000 g 0.245
## Score chi2 43.75 gr 1.277
## Pr(> chi2) 0.0000
##
## Coef S.E. Wald Z Pr(>|Z|)
## rx=Lev -0.0193 0.0769 -0.25 0.8022
## rx=Lev+5FU -0.4368 0.0840 -5.20 <0.0001
## sex=Male -0.0440 0.0662 -0.66 0.5066
## perfor=Yes 0.1992 0.1819 1.10 0.2735
## age -0.0012 0.0028 -0.44 0.6606
## obstruct=Yes 0.2120 0.0817 2.60 0.0094
# We can also remove a variable
update(fit, . ~ . - age)
##
## Cox Proportional Hazards Model
##
## cph(formula = sn ~ rx + sex + perfor, data = colon)
##
## Model Tests Discrimination
## Indexes
## Obs 1858 LR chi2 37.85 R2 0.020
## Events 920 d.f. 4 Dxy 0.096
## Center -0.1712 Pr(> chi2) 0.0000 g 0.214
## Score chi2 36.43 gr 1.238
## Pr(> chi2) 0.0000
##
## Coef S.E. Wald Z Pr(>|Z|)
## rx=Lev -0.0205 0.0769 -0.27 0.7895
## rx=Lev+5FU -0.4430 0.0839 -5.28 <0.0001
## sex=Male -0.0522 0.0661 -0.79 0.4297
## perfor=Yes 0.2674 0.1800 1.49 0.1374
When I do crude adjustments I simply remove all other variables:
update(fit, . ~ rx)
##
## Cox Proportional Hazards Model
##
## cph(formula = sn ~ rx, data = colon)
##
## Model Tests Discrimination
## Indexes
## Obs 1858 LR chi2 35.23 R2 0.019
## Events 920 d.f. 2 Dxy 0.089
## Center -0.1512 Pr(> chi2) 0.0000 g 0.194
## Score chi2 33.63 gr 1.215
## Pr(> chi2) 0.0000
##
## Coef S.E. Wald Z Pr(>|Z|)
## rx=Lev -0.0209 0.0768 -0.27 0.7859
## rx=Lev+5FU -0.4408 0.0839 -5.25 <0.0001
Nice, isn't it? I find it really useful as I'm comparing AIC/BIC-values, testing interactions etc.
Where is the monster?
While the update() function is appealing in its simplicity there is much more that needed to be in place for this to work. The basic function needs to figure out the type of regression, loop through the variables, and check for the intercept. It still needs some work, this was one of my first functions that I created in my package so the coding is a little clumpsy - feel free to suggest improvements. You can find the raw code under my GitHub-account.
#' This function helps with printing regression models
#'
#' This function is used for getting the adjusted and unadjusted values
#' for a regression model. It takes a full model and walks through each
#' variable, removes in the regression all variables except one then
#' reruns that variable to get the unadjusted value. This functions not
#' intended for direct use, it's better to use \code{\link{printCrudeAndAdjustedModel}}
#' that utilizes this function.
#'
#' This function saves a lot of time creating tables since it compiles a fully
#' unadjusted list of all your used covariates.
#'
#' If the model is an exponential poisson/logit/cox regression model then it automatically
#' reports the exp() values instead of the original values
#'
#' The function skips by default all spline variables since this becomes very complicated
#' and there is no simple \deqn{\beta}{beta} to display. For the same reason it skips
#' any interaction variables since it's probably better to display these as a contrast table.
#'
#' @param fit The regression model
#' @return Returns a matrix with the columns c('Crude', '95\% CI', 'Adjusted', '95\% CI').
#' The rows are not changed from the original fit.
#'
#' @seealso \code{\link{printCrudeAndAdjustedModel}}
#'
#' @example examples/getCrudeAndAdjustedModelData_example.R
#'
#' @import boot
#'
#' @author max
#' @export
getCrudeAndAdjustedModelData <- function(fit) {
if ("Glm" %in% class(fit))
stop("The rms::Glm does not work properly with this function, please use regular glm instead")
# Just a prettifier for the output an alternative could be:
# paste(round(x[,1],1), ' (95% CI ', min(round(x[,2:3])), '-',
# max(round(x[,2:3])), ')', sep='')
get_coef_and_ci <- function(fit, skip_intercept = FALSE) {
# Get the coefficients
my_coefficients <- coef(fit)
coef_names <- names(my_coefficients)
# Confint isn't implemented for the rms package we therefore need to do a
# work-around
if ("rms" %in% class(fit) && "ols" %nin% class(fit)) {
# The summary function has to be invoked to get this to work in the confint
# version and the summary.rms doesn't return a profile.glm() compatible
# list, it returns a matrix It works though OK for regular ordinary least
# square regression
if ("lrm" %in% class(fit)) {
warning("Using the confint.default, i.e. Wald statistics, and it may not be optimal but profile CI is not implemented for the lrm, an option could be to change to the Glm with the family=binomial")
ci <- confint.default(fit)
} else if ("Glm" %in% class(fit)) {
# The summary.rms is causing issues when profiling, it works fine when
# summary.glm is issued, we need to change class for that
class(fit) <- class(fit)["rms" != class(fit)]
ci <- confint(fit)
} else if ("cph" %in% class(fit)) {
# I haven't found any large issues with this call
ci <- confint(fit)
} else {
# Untested option try confint for better or worse
warning("The function has not been tested with this type of regression object, may not work properly.")
ci <- confint(fit)
}
} else {
ci <- confint(fit)
}
if (skip_intercept) {
intercept <- grep("Intercept", coef_names)
if (length(intercept)) {
my_coefficients <- my_coefficients[-intercept]
ci <- ci[-intercept, ]
coef_names <- coef_names[-intercept]
}
}
# Use the exp() if logit or cox regression
if ("coxph" %in% class(fit) || ("glm" %in% class(fit) && fit$family$link %in%
c("logit", "log"))) {
my_coefficients <- exp(my_coefficients)
ci <- exp(ci)
}
if (length(my_coefficients) > 1)
ret_val <- cbind(my_coefficients, ci) else ret_val <- matrix(c(my_coefficients, ci), nrow = 1)
colnames(ret_val) <- c("", "2.5%", "97.5%")
rownames(ret_val) <- coef_names
return(ret_val)
}
all.terms <- terms(fit)
var_names <- attr(all.terms, "term.labels")
# Skip variables consisting of functions such as spline, strata variables
regex_for_unwanted_vars <- "^(ns|rcs|bs|pspline)[(]"
skip_variables <- grep(regex_for_unwanted_vars, var_names)
skip_variables_names <- c()
if (length(skip_variables) > 0) {
for (vn in skip_variables) {
match <- gregexpr("\\w+\\((?\\w+)", var_names[vn], perl = TRUE)[[1]]
if (match[1] >= 0) {
st <- attr(match, "capture.start")[1, ]
skip_variables_names <- append(skip_variables_names, substring(var_names[vn],
st, st + attr(match, "capture.length")[1, ] - 1))
}
}
}
# Skips the interaction terms since they should be displayed in a different
# fashion
raw_int_terms <- grep("[[:alnum:]]+:[[:alnum:]]+", var_names)
if (length(raw_int_terms) > 0) {
require("stringr")
for (i in raw_int_terms) {
for (name in str_split(var_names[i], ":")[[1]]) {
skip_variables_names <- append(skip_variables_names, name)
skip_variables <- union(skip_variables, grep(name, var_names))
}
}
}
skip_variables <- union(skip_variables, grep("^strat[a]{0,1}\\(", var_names))
# Get the adjusted variables
adjusted <- get_coef_and_ci(fit)
# When using splines, rcs in cox regression this shows a little different
# Remove all the splines, rcs etc
rn <- rownames(adjusted)
remove <- grep("('{1,}|[[][0-9]+[]]|[)][0-9]+)$", rn)
for (name in skip_variables_names) {
m <- grep(name, rn)
if (length(m) > 0)
remove <- union(remove, m)
}
# Remove the unwanted rows if any found
if (length(remove) > 0) {
adjusted <- adjusted[-remove, ]
}
unadjusted <- c()
if (length(skip_variables) > 0) {
variables_to_check <- var_names[-skip_variables]
} else {
variables_to_check <- var_names
}
if (length(variables_to_check) == 0)
stop("You have no variables that can be displayed as adjusted/unadjusted since they all are part of an interaction, spline or strata")
for (variable in variables_to_check) {
interaction_variable <- length(grep(":", variable)) > 0
# If it's an interaction variable the interacting parts have to be included
if (interaction_variable) {
variable <- paste(paste(unlist(strsplit(variable, ":")), sep = "",
collapse = " + "), variable, sep = " + ")
}
# Run the same fit but with only one variable
fit_only1 <- update(fit, paste(".~", variable))
if ("boot.coef" %in% names(fit)) {
# Redo bootstrap if this was a bootstrapped rms model by rerunning the
# bootcov part
fit_only1 <- bootcov(fit_only1, B = fit$B, coef.reps = "boot.Coef" %in%
names(fit))
}
# Get the coefficients processed with some advanced round part()
new_vars <- get_coef_and_ci(fit_only1, skip_intercept = TRUE)
# If interaction then we should only keep the interaction part - the other
# variables are always included by default and need therefore to be removed
if (interaction_variable) {
new_vars <- new_vars[grep("[*:]", rownames(new_vars)), ]
}
# Add them to the previous
unadjusted <- rbind(unadjusted, new_vars)
}
# If regression contains Intercept this is meaningless for the comparison of
# adjusted and unadjusted values
if (length(grep("Intercept", rownames(adjusted))) > 0) {
if ("ols" %in% class(fit)) {
# The simple .~1 doesn't work for me with the ols and I've therefore created
# this workaround
if ("boot.coef" %in% names(fit)) {
if ("y" %nin% names(fit)) {
warning("If you want the intercept on a bootstrapped ols() model then you need to specify y=TRUE or else you get NA in the unadjusted estimates")
new_vars <- rep(NA, times = 3)
} else {
require("boot") || stop("`boot' package not found - it's needed for the unadjusted estimate of the intercept")
ci <- boot.ci(boot(fit$y, function(x, i) mean(x[i], na.rm = TRUE),
R = fit$B, stype = "i"), type = ifelse("boot.Coef" %in%
names(fit), "perc", "norm"))
new_vars <- c(mean(fit$y, na.rm = TRUE), ci$percent[4:5])
}
} else {
new_vars <- c(mean(fit$y, na.rm = TRUE), mean(fit$y) + c(-1,
1) * sd(fit$y)/sqrt(sum(is.na(fit$y) == FALSE)) * qnorm(0.05/2))
}
} else {
# Run the same fit but with only one variable
fit_only1 <- update(fit, ".~ 1")
if ("boot.coef" %in% names(fit)) {
# Redo bootstrap if this was a bootstrapped rms model by rerunning the
# bootcov part
fit_only1 <- bootcov(fit_only1, B = fit$B, coef.reps = "boot.Coef" %in%
names(fit))
}
# Get the coefficients processed with some advanced round part()
new_vars <- get_coef_and_ci(fit_only1, skip_intercept = FALSE)
}
unadjusted <- rbind(new_vars, unadjusted)
rownames(unadjusted)[1] <- "Intercept"
}
# If just one variable it's not a proper matrix
if (is.null(dim(adjusted))) {
both <- matrix(c(unadjusted, adjusted), nrow = 1)
} else {
both <- cbind(unadjusted, adjusted)
}
colnames(both) <- c("Crude", "2.5 %", "97.5 %", "Adjusted", "2.5 %", "97.5 %")
return(both)
}
I hope you'll find this as useful as I have.
Excellent work. Very cool.
Nice use of a photo of “Crude Awakening” from Burning Man 2007. Are you a burner?
No, heard about it. Mostly I just fell in love with the photo.
Check out my photos from the event 🙂
http://3dculture.com/bm3d
Wonderful work. The only thing I miss in the tables are p-values or the stars on the side… (or eventually, colouring those rows which are statistically significant)
Thanks. The main problem with p-values is that there is no pval() function that would correspond to the confint(). Every time I need raw p-values I look at the print/summary functions and it is difficult to generalize. I guess the vcov() could be used but I’m uncomfortable inventing my own pvalue calculator – I have for instance no idea how to calculate p-values for quantile regression.
One possible way is to take the full model, and then use the “update” command (on each of the variables in the formula) to remove them one at a time and then run an ANOVA to compare “full model” vs “full model minus one variable”. That is what I’ve been doing, and it works really nicely because it works seamlessly switching between categorical and single variables.
Here is the (bad) code I’ve been using:
fit <- lm(blah)
pvalues <- c()
for(i in variables){
smallFit <- eval(parse(text=paste("update(fit, . ~ . – ",i,")",sep="")))
test <- anova(fit, smallFit)
pvalues <- c(pvalues, test$p)
}
How Can I get package Greg? That would enable me get regression tables I need for my data but as is following your example, i am not able