Too crude to be true?

Harvesting the magical update() function. The image is CC by Waldemar Horwat.

Harvesting the magical update() function. The image is CC by Waldemar Horwat.

The key to programming is being lazy; it has actually been called a virtue by some. When I discovered the update() function it blew me away. Within short I had created a monster based upon this tiny function, allowing quick and easy output of regression tables that contain crude and adjusted estimates. In this post I’ll show you how to tame the printCrudeAndAdjusted() function in my Greg-package and show a little behind the scenes.

Although not always present I think a regression table that shows both the crude and adjusted estimates conveys important information. The crude is also referred to as un-adjusted and if this sharply drops in the adjusted model we can suspect that there is a strong confounder that steals some of the effect that this variable has.

A simple walk-through

We will start with basic usage of the function. For this I’m using the colon data set from the survival package. So first we prepare our data by generating a few factors:

library(survival)
data(colon)
# Choose th ones wih deah as outcome
colon <- subset(colon, etype = 2)
# Change into years
colon$time <- colon$time/365.25
# Factor these binary variables or the software will interpret them as
# numeric even if the regression is correct
colon$sex <- factor(colon$sex, labels = c("Female", "Male"))
colon$perfor <- factor(colon$perfor, labels = c("No", "Yes"))
colon$obstruct <- factor(colon$obstruct, labels = c("No", "Yes"))

If the names are a little to shorthand and we want prettier names it is easy to set the label of the name using the Hmisc label() function:

library(Hmisc, verbose = FALSE)
label(colon$rx) <- "Treatment"
label(colon$perfor) <- "Colon perforation"
label(colon$age) <- "Age (years)"
label(colon$sex) <- "Sex"

The regression model is setup in the regular way. I use the rms-package for a lot of my work but regular lm(), glm() and coxph() should work just as well. If you're not familiar with the rms-package it likes to have a summary on the data set that is stored in the datadist.

library(rms, verbose = FALSE)
dd <- datadist(colon[, c("time", "status", "rx", "sex", "obstruct", "perfor", 
    "age")])
options(datadist = "dd")
sn <- Surv(colon$time, colon$status)
fit <- cph(sn ~ rx + sex + perfor + age, data = colon)

And now it is time for my monster function. I like having the reference category in the table so I always add the add_reference option:

library(Greg)
printCrudeAndAdjustedModel(fit, add_references = TRUE, ctable = TRUE)
VariableCrude Adjusted
 HR2.5 % to 97.5 % HR2.5 % to 97.5 %
Treatment
  Obs1ref 1ref
  Lev0.980.84 to 1.14 0.980.84 to 1.14
  Lev+5FU0.640.55 to 0.76 0.640.55 to 0.76
Sex
  Female1ref 1ref
  Male0.970.85 to 1.10 0.950.83 to 1.08
Colon perforation
  No1ref 1ref
  Yes1.300.92 to 1.85 1.300.91 to 1.85
Age (years)1.000.99 to 1.00 1.000.99 to 1.00

In event data like this it is also nice to have some information regarding the variables raw distribution and we can easily combine the regression model with some descriptive data by adding the desc_column option:

printCrudeAndAdjustedModel(fit, 
                           desc_column=TRUE,
                           add_references=TRUE, 
                           ctable=TRUE)
Variable  Crude Adjusted
 TotalEvent HR2.5 % to 97.5 % HR2.5 % to 97.5 %
Treatment
  Obs630345 (54.76 %) 1ref 1ref
  Lev620333 (53.71 %) 0.980.84 to 1.14 0.980.84 to 1.14
  Lev+5FU608242 (39.80 %) 0.640.55 to 0.76 0.640.55 to 0.76
Sex
  Female890444 (49.89 %) 1ref 1ref
  Male968476 (49.17 %) 0.970.85 to 1.10 0.950.83 to 1.08
Colon perforation
  No1804888 (49.22 %) 1ref 1ref
  Yes5432 (59.26 %) 1.300.92 to 1.85 1.300.91 to 1.85
Age (years)59.75 (± 11.95)59.46 (± 12.35) 1.000.99 to 1.00 1.000.99 to 1.00

The function does not yet handle functions. Currently it skips these by removing any variable from the formula that matches the regular expression: "^(ns|rcs|bs|pspline)[(]". In the future I image it could be possible to add something that uses the rms:::contrast() function, although that will probably wait until the need arises.

What about the update()

As I mentioned in the introduction the update function is at the heart of this function. It has the ability to on the fly add, remove, and change datasets and other nice options to your model.

# Now lets keep everything the same and add the obstruct variable
update(fit, . ~ . + obstruct)
## 
## Cox Proportional Hazards Model
## 
## cph(formula = sn ~ rx + sex + perfor + age + obstruct, data = colon)
## 
##                      Model Tests       Discrimination    
##                                           Indexes        
## Obs       1858    LR chi2     44.77    R2       0.024    
## Events     920    d.f.            6    Dxy      0.126    
## Center -0.1992    Pr(> chi2) 0.0000    g        0.245    
##                   Score chi2  43.75    gr       1.277    
##                   Pr(> chi2) 0.0000                      
## 
##              Coef    S.E.   Wald Z Pr(>|Z|)
## rx=Lev       -0.0193 0.0769 -0.25  0.8022  
## rx=Lev+5FU   -0.4368 0.0840 -5.20  <0.0001 
## sex=Male     -0.0440 0.0662 -0.66  0.5066  
## perfor=Yes    0.1992 0.1819  1.10  0.2735  
## age          -0.0012 0.0028 -0.44  0.6606  
## obstruct=Yes  0.2120 0.0817  2.60  0.0094

# We can also remove a variable
update(fit, . ~ . - age)
## 
## Cox Proportional Hazards Model
## 
## cph(formula = sn ~ rx + sex + perfor, data = colon)
## 
##                      Model Tests       Discrimination    
##                                           Indexes        
## Obs       1858    LR chi2     37.85    R2       0.020    
## Events     920    d.f.            4    Dxy      0.096    
## Center -0.1712    Pr(> chi2) 0.0000    g        0.214    
##                   Score chi2  36.43    gr       1.238    
##                   Pr(> chi2) 0.0000                      
## 
##            Coef    S.E.   Wald Z Pr(>|Z|)
## rx=Lev     -0.0205 0.0769 -0.27  0.7895  
## rx=Lev+5FU -0.4430 0.0839 -5.28  <0.0001 
## sex=Male   -0.0522 0.0661 -0.79  0.4297  
## perfor=Yes  0.2674 0.1800  1.49  0.1374

When I do crude adjustments I simply remove all other variables:

update(fit, . ~ rx)
## 
## Cox Proportional Hazards Model
## 
## cph(formula = sn ~ rx, data = colon)
## 
##                      Model Tests       Discrimination    
##                                           Indexes        
## Obs       1858    LR chi2     35.23    R2       0.019    
## Events     920    d.f.            2    Dxy      0.089    
## Center -0.1512    Pr(> chi2) 0.0000    g        0.194    
##                   Score chi2  33.63    gr       1.215    
##                   Pr(> chi2) 0.0000                      
## 
##            Coef    S.E.   Wald Z Pr(>|Z|)
## rx=Lev     -0.0209 0.0768 -0.27  0.7859  
## rx=Lev+5FU -0.4408 0.0839 -5.25  <0.0001

Nice, isn't it? I find it really useful as I'm comparing AIC/BIC-values, testing interactions etc.

Where is the monster?

While the update() function is appealing in its simplicity there is much more that needed to be in place for this to work. The basic function needs to figure out the type of regression, loop through the variables, and check for the intercept. It still needs some work, this was one of my first functions that I created in my package so the coding is a little clumpsy - feel free to suggest improvements. You can find the raw code under my GitHub-account.

#' This function helps with printing regression models
#' 
#' This function is used for getting the adjusted and unadjusted values
#' for a regression model. It takes a full model and walks through each
#' variable, removes in the regression all variables except one then
#' reruns that variable to get the unadjusted value. This functions not
#' intended for direct use, it's better to use \code{\link{printCrudeAndAdjustedModel}}
#' that utilizes this function.
#' 
#' This function saves a lot of time creating tables since it compiles a fully
#' unadjusted list of all your used covariates.
#' 
#' If the model is an exponential poisson/logit/cox regression model then it automatically
#' reports the exp() values instead of the original values
#' 
#' The function skips by default all spline variables since this becomes very complicated
#' and there is no simple \deqn{\beta}{beta} to display. For the same reason it skips
#' any interaction variables since it's probably better to display these as a contrast table. 
#' 
#' @param fit The regression model
#' @return Returns a matrix with the columns c('Crude', '95\% CI', 'Adjusted', '95\% CI').
#'   The rows are not changed from the original fit.
#'
#' @seealso \code{\link{printCrudeAndAdjustedModel}}
#' 
#' @example examples/getCrudeAndAdjustedModelData_example.R
#' 
#' @import boot
#' 
#' @author max
#' @export
getCrudeAndAdjustedModelData <- function(fit) {

    if ("Glm" %in% class(fit)) 
        stop("The rms::Glm does not work properly with this function, please use regular glm instead")

    # Just a prettifier for the output an alternative could be:
    # paste(round(x[,1],1), ' (95% CI ', min(round(x[,2:3])), '-',
    # max(round(x[,2:3])), ')', sep='')
    get_coef_and_ci <- function(fit, skip_intercept = FALSE) {
        # Get the coefficients
        my_coefficients <- coef(fit)
        coef_names <- names(my_coefficients)

        # Confint isn't implemented for the rms package we therefore need to do a
        # work-around
        if ("rms" %in% class(fit) && "ols" %nin% class(fit)) {
            # The summary function has to be invoked to get this to work in the confint
            # version and the summary.rms doesn't return a profile.glm() compatible
            # list, it returns a matrix It works though OK for regular ordinary least
            # square regression
            if ("lrm" %in% class(fit)) {
                warning("Using the confint.default, i.e. Wald statistics, and it may not be optimal but profile CI is not implemented for the lrm, an option could be to change to the Glm with the family=binomial")
                ci <- confint.default(fit)
            } else if ("Glm" %in% class(fit)) {
                # The summary.rms is causing issues when profiling, it works fine when
                # summary.glm is issued, we need to change class for that
                class(fit) <- class(fit)["rms" != class(fit)]
                ci <- confint(fit)
            } else if ("cph" %in% class(fit)) {
                # I haven't found any large issues with this call
                ci <- confint(fit)
            } else {
                # Untested option try confint for better or worse
                warning("The function has not been tested with this type of regression object, may not work properly.")
                ci <- confint(fit)
            }
        } else {
            ci <- confint(fit)
        }

        if (skip_intercept) {
            intercept <- grep("Intercept", coef_names)
            if (length(intercept)) {
                my_coefficients <- my_coefficients[-intercept]
                ci <- ci[-intercept, ]
                coef_names <- coef_names[-intercept]
            }
        }

        # Use the exp() if logit or cox regression
        if ("coxph" %in% class(fit) || ("glm" %in% class(fit) && fit$family$link %in% 
            c("logit", "log"))) {
            my_coefficients <- exp(my_coefficients)
            ci <- exp(ci)
        }

        if (length(my_coefficients) > 1) 
            ret_val <- cbind(my_coefficients, ci) else ret_val <- matrix(c(my_coefficients, ci), nrow = 1)

        colnames(ret_val) <- c("", "2.5%", "97.5%")
        rownames(ret_val) <- coef_names
        return(ret_val)
    }

    all.terms <- terms(fit)
    var_names <- attr(all.terms, "term.labels")

    # Skip variables consisting of functions such as spline, strata variables
    regex_for_unwanted_vars <- "^(ns|rcs|bs|pspline)[(]"
    skip_variables <- grep(regex_for_unwanted_vars, var_names)
    skip_variables_names <- c()
    if (length(skip_variables) > 0) {
        for (vn in skip_variables) {
            match <- gregexpr("\\w+\\((?\\w+)", var_names[vn], perl = TRUE)[[1]]
            if (match[1] >= 0) {
                st <- attr(match, "capture.start")[1, ]
                skip_variables_names <- append(skip_variables_names, substring(var_names[vn], 
                  st, st + attr(match, "capture.length")[1, ] - 1))
            }
        }
    }

    # Skips the interaction terms since they should be displayed in a different
    # fashion
    raw_int_terms <- grep("[[:alnum:]]+:[[:alnum:]]+", var_names)
    if (length(raw_int_terms) > 0) {
        require("stringr")
        for (i in raw_int_terms) {
            for (name in str_split(var_names[i], ":")[[1]]) {
                skip_variables_names <- append(skip_variables_names, name)
                skip_variables <- union(skip_variables, grep(name, var_names))
            }
        }
    }

    skip_variables <- union(skip_variables, grep("^strat[a]{0,1}\\(", var_names))

    # Get the adjusted variables
    adjusted <- get_coef_and_ci(fit)
    # When using splines, rcs in cox regression this shows a little different

    # Remove all the splines, rcs etc
    rn <- rownames(adjusted)
    remove <- grep("('{1,}|[[][0-9]+[]]|[)][0-9]+)$", rn)
    for (name in skip_variables_names) {
        m <- grep(name, rn)
        if (length(m) > 0) 
            remove <- union(remove, m)
    }

    # Remove the unwanted rows if any found
    if (length(remove) > 0) {
        adjusted <- adjusted[-remove, ]
    }

    unadjusted <- c()
    if (length(skip_variables) > 0) {
        variables_to_check <- var_names[-skip_variables]
    } else {
        variables_to_check <- var_names
    }
    if (length(variables_to_check) == 0) 
        stop("You have no variables that can be displayed as adjusted/unadjusted since they all are part of an interaction, spline or strata")

    for (variable in variables_to_check) {
        interaction_variable <- length(grep(":", variable)) > 0

        # If it's an interaction variable the interacting parts have to be included
        if (interaction_variable) {
            variable <- paste(paste(unlist(strsplit(variable, ":")), sep = "", 
                collapse = " + "), variable, sep = " + ")
        }

        # Run the same fit but with only one variable
        fit_only1 <- update(fit, paste(".~", variable))
        if ("boot.coef" %in% names(fit)) {
            # Redo bootstrap if this was a bootstrapped rms model by rerunning the
            # bootcov part
            fit_only1 <- bootcov(fit_only1, B = fit$B, coef.reps = "boot.Coef" %in% 
                names(fit))
        }

        # Get the coefficients processed with some advanced round part()
        new_vars <- get_coef_and_ci(fit_only1, skip_intercept = TRUE)

        # If interaction then we should only keep the interaction part - the other
        # variables are always included by default and need therefore to be removed
        if (interaction_variable) {
            new_vars <- new_vars[grep("[*:]", rownames(new_vars)), ]
        }

        # Add them to the previous
        unadjusted <- rbind(unadjusted, new_vars)
    }

    # If regression contains Intercept this is meaningless for the comparison of
    # adjusted and unadjusted values
    if (length(grep("Intercept", rownames(adjusted))) > 0) {
        if ("ols" %in% class(fit)) {
            # The simple .~1 doesn't work for me with the ols and I've therefore created
            # this workaround
            if ("boot.coef" %in% names(fit)) {
                if ("y" %nin% names(fit)) {
                  warning("If you want the intercept on a bootstrapped ols() model then you need to specify y=TRUE or else you get NA in the unadjusted estimates")
                  new_vars <- rep(NA, times = 3)
                } else {
                  require("boot") || stop("`boot' package not found - it's needed for the unadjusted estimate of the intercept")
                  ci <- boot.ci(boot(fit$y, function(x, i) mean(x[i], na.rm = TRUE), 
                    R = fit$B, stype = "i"), type = ifelse("boot.Coef" %in% 
                    names(fit), "perc", "norm"))
                  new_vars <- c(mean(fit$y, na.rm = TRUE), ci$percent[4:5])
                }
            } else {
                new_vars <- c(mean(fit$y, na.rm = TRUE), mean(fit$y) + c(-1, 
                  1) * sd(fit$y)/sqrt(sum(is.na(fit$y) == FALSE)) * qnorm(0.05/2))
            }

        } else {
            # Run the same fit but with only one variable
            fit_only1 <- update(fit, ".~ 1")
            if ("boot.coef" %in% names(fit)) {
                # Redo bootstrap if this was a bootstrapped rms model by rerunning the
                # bootcov part
                fit_only1 <- bootcov(fit_only1, B = fit$B, coef.reps = "boot.Coef" %in% 
                  names(fit))
            }

            # Get the coefficients processed with some advanced round part()
            new_vars <- get_coef_and_ci(fit_only1, skip_intercept = FALSE)
        }

        unadjusted <- rbind(new_vars, unadjusted)
        rownames(unadjusted)[1] <- "Intercept"
    }

    # If just one variable it's not a proper matrix
    if (is.null(dim(adjusted))) {
        both <- matrix(c(unadjusted, adjusted), nrow = 1)
    } else {
        both <- cbind(unadjusted, adjusted)
    }
    colnames(both) <- c("Crude", "2.5 %", "97.5 %", "Adjusted", "2.5 %", "97.5 %")
    return(both)
}

I hope you'll find this as useful as I have.

8 thoughts on “Too crude to be true?

  1. Wonderful work. The only thing I miss in the tables are p-values or the stars on the side… (or eventually, colouring those rows which are statistically significant)

    • Thanks. The main problem with p-values is that there is no pval() function that would correspond to the confint(). Every time I need raw p-values I look at the print/summary functions and it is difficult to generalize. I guess the vcov() could be used but I’m uncomfortable inventing my own pvalue calculator – I have for instance no idea how to calculate p-values for quantile regression.

      • One possible way is to take the full model, and then use the “update” command (on each of the variables in the formula) to remove them one at a time and then run an ANOVA to compare “full model” vs “full model minus one variable”. That is what I’ve been doing, and it works really nicely because it works seamlessly switching between categorical and single variables.

        Here is the (bad) code I’ve been using:

        fit <- lm(blah)

        pvalues <- c()
        for(i in variables){
        smallFit <- eval(parse(text=paste("update(fit, . ~ . – ",i,")",sep="")))
        test <- anova(fit, smallFit)
        pvalues <- c(pvalues, test$p)
        }

  2. How Can I get package Greg? That would enable me get regression tables I need for my data but as is following your example, i am not able

Leave a Reply

Your email address will not be published. Required fields are marked *

This site uses Akismet to reduce spam. Learn how your comment data is processed.