Cox Proportional Hazards Survival Regression

Revised 10/24/2007 -- Better convergence properties for ill-conditioned data -- Thank you Rupendra Chulyadyo !

Background ||| Techie-Stuff ||| Instructions

This page analyzes survival-time data by the method of Proportional Hazards regression (Cox). Given survival times, final status (alive or dead) , and one or more covariates, it produces a baseline survival curve, covariate coefficient estimates with their standard errors, risk ratios, 95% confidence intervals, and significance levels.

Background Info: (just what is Proportional Hazards Survival Regression, anyway?)

Survival analysis takes the survival times of a group of subjects (usually with some kind of medical condition) and generates a survival curve, which shows how many of the members remain alive over time. Survival time is usually defined as the length of the interval between diagnosis and death, although other "start" events (such as surgery instead of diagnosis), and other "end" events (such as recurrence instead of death) are sometimes used.

The major mathematical complication with survival analysis is that you usually do not have the luxury of waiting until the very last subject has died of old age; you normally have to analyze the data while some subjects are still alive. Also, some subjects may have moved away, and may be lost to follow-up. In both cases, the subjects were known to have survived for some amount of time (up until the time you last saw them), but you don't know how much longer they might ultimately have survived. Several methods have been developed for using this "at least this long" information to preparing unbiased survival curve estimates, the most common being the Life Table method and the method of Kaplan and Meier.

We often need to know whether survival is influenced by one or more factors, called "predictors" or "covariates", which may be categorical (such as the kind of treatment a patient received) or continuous (such as the patient's age, weight, or the dosage of a drug). For simple situations involving a single factor with just two values (such as drug vs placebo), there are methods for comparing the survival curves for the two groups of subjects. But for more complicated situations we need a special kind of regression that lets us assess the effect of each predictor on the shape of the survival curve.

To understand the method of proportional hazards, first consider a "baseline" survival curve. This can be thought of as the survival curve of a hypothetical "completely average" subject -- someone for whom each predictor variable is equal to the average value of that variable for the entire set of subjects in the study. This baseline survival curve doesn't have to have any particular formula representation; it can have any shape whatever, as long as it starts at 1.0 at time 0 and descends steadily with increasing survival time.

The baseline survival curve is then systematically "flexed" up or down by each of the predictor variables, while still keeping its general shape. The proportional hazards method computes a coefficient for each predictor variable that indicates the direction and degree of flexing that the predictor has on the survival curve. Zero means that a variable has no effect on the curve -- it is not a predictor at all; a positive variable indicates that larger values of the variable are associated with greater mortality. Knowing these coefficients, we could construct a "customized" survival curve for any particular combination of predictor values. More importantly, the method provides a measure of the sampling error associated with each predictor's coefficient. This lets us assess which variables' coefficients are significantly different from zero; that is: which variables are significantly related to survival.

Techie-stuff: (for those who might be interested)

This page contains a straightforward JavaScript implementation of a standard iterative method for Cox Proportional Hazard Survival Regression.

The log-likelihood function is minimized by Newton's method, with a very simple elimination algorithm to invert and solve the simultaneous equations. Central-limit estimates of parameter standard errors are obtained from the diagonal terms of the inverse matrix. 95% confidence intervals around the parameter estimates are obtained by a normal approximation. Risk ratios (and their confidence limits) are computed as exponential functions of the parameters (and their confidence limits). The baseline survival function is generated for each time point at which an event (death) occurred.

To decrease the chances of the iterations diverging for ill-conditioned data, a very simple "ramped-up fractional-shifts" modification of Newton's method is used: For the first iteration, only 1/10^th of the calculated parameter adjustments are applied. For the second iteration, the fraction is 2/10^ths, then 3/10^ths, etc., until the full amount of the calculated adjustments are applied at the 10^th and all subsequent iterations. For improved precision, the independent variables are temporarily converted to "standard scores" ( value - Mean ) / StdDev. The Null Model (all parameters = 0 )is used as the starting guess for the iterations. Convergence is not guaranteed, but this page should work properly with most real-world data.

There are no predefined limits to the number of variables or cases this page can handle. The actual limits are probably dependent on your browser's available memory. I have run this program successfully on data sets with 4 predictors and over 1,200 cases.

The fields below are pre-loaded with a very simple example.

Instructions:

Enter the number of data points:
Enter the number of covariates (predictors): Normally 1 or more. If 0 (no predictor variables at all) is specified, the baseline survival function will be the same as the Kaplan-Meier survival curve.
Type (or paste) the [x values, time, status] data:
50,1,0 70,2,1 45,3,0 35,5,0 62,7,1 50,11,0 45,4,0 57,6,0 32,8,0 57,9,1 60,10,1
Use a separate row for each data point. The covariates (predictors) should come first, followed by the survival time, followed by the last-seen-status variable (1 if died, 0 if still alive or lost to follow-up). Values should be separated by commas or tabs. You can copy data from another program, like a spreadsheet, and paste it into the window above. It may come in as tab-delimited text (without commas), but this will not be a problem; the program will convert tabs to commas during the computations.
Click the button. The results will (eventually) appear below:
To print out results, copy and past the contents of the Output window above into a word processor or text editor, then Print. For best appearance, specify a fixed-width font like Courier.

Reference: Statistical Models and Methods for Lifetime Data, by J. F. Lawless. 1982, John Wiley & Sons, New York.

Return to the Interactive Statistics page or to the JCP Home Page

Send e-mail to John C. Pezzullo at statpages.org@gmail.com