Data Examples for Logistic Regression

by Kevin M. Sullivan

Version 2015-03-15

Return to the Logistic Regression page

A number of examples are provided on the format to enter data.  All examples are based on the Evans County data set described in Kleinbaum, Kupper, and Morgenstern, Epidemiologic Research: Principles and Quantitative Methods, New York: Van Nostrand Reinhold, 1982.  The Evans County study was a cohort study of men followed for 7 years.  The files are also available as text files to allow the user to cut and paste the example data into the Data Window.

Data can be in two formats - records at the individual level (one record for each individual or whatever the unit of analysis) or the data could be summary information, such as the number of individuals at an exposure level without disease and the number with disease.  The data on one line must be separated by a tab or a comma; the examples below use the comma to separate data points.  These examples first describe data at the individual level, and then describe summary data.

Data at the individuals level, one exposure variable

Enter or paste the data into the Data Window a dichotomous exposure variable (coded as 1 for exposed and 0 for unexposed) and the outcome variable (coded as 1 for with the outcome and 0 for without the outcome) with the two variables separated by a "," or a tab.  For example, in assessing the relationship between an elevated catecholamine level (the exposure of interest, 1= elevated and 0= normal) and coronary heart disease (CHD, the outcome of interest), the records would be formatted as numeric values for:

exposure variable value, outcome variable value

For this example data the number of data points is 609 and the number of predictor variables is 1.  The first 10 records from the example data are shown below:

0, 0
0, 0
1, 1
1, 0
0, 0
0, 0
0, 1
0, 0
0, 0
0, 0 

... (plus 599 additional lines)

 

The full data file as a text file can be found here.  The results of the analysis would be:

Odds Ratios and 95% Confidence Intervals...
Variable  O.R.    Low -- High
1         2.8615 1.6878 4.8514

 

The interpretation would be that individuals with elevated catecholamine levels have a 2.8615 greater odds of developing CHD compared to individuals with normal catecholamine levels.

[A note on coding the exposure variable:  The above example coded the exposed as 1 and unexposed as 0, and the odds ratio was calculated  comparing the odds of being coded as 1 to being coded as 0 - note that those coded as 0 are the referent group.  If you code the exposure as 1 and 2, the smaller number will be treated as the referent group, which in this example is 1.  The odds ratio for a 2/1 coding scheme would be the odds of disease for those coded as 2 compared to the odds in those coded as 1.]

[A note on coding the outcome variable: The outcome variable must be coded as 1 for those with the outcome and 0 for those without the outcome.]

If the exposure variable is continuous, you can use the numeric value (which assumes the relationship is linear on a logit scale).  For example, in assessing the relationship between age and CHD, the number of data points is 609 and the number of predictor variables is 1, and the first ten records would look like as shown below (data as a text file can be found here):

56, 0
43, 0
56, 1
64, 0
49, 0
46, 0
52, 1
63, 0
42, 0
55, 0

... (plus 599 additional lines)

The results of the analysis would be:

Odds Ratios and 95% Confidence Intervals...
Variable  O.R.    Low -- High
1         1.0454 1.0189 1.0727

The interpretation would be that for every one year increase in age, the odds of CHD increased by a factor of 1.0454 (or by about 4.5%).

Data at the individuals level, two exposure variables - no interaction model

If there is more than one exposure variable, list the exposure variables first and the outcome variable last.  For example, say the investigator wants to determine the simultaneous effect of catecholamine and cigarette smoking (1=smoker, 0=nonsmoker) on CHD, the data would be:

first exposure variable value, second exposure variable value, outcome variable value

For this example data the number of data points is 609 and the number of predictor variables is 2.  The first 10 records from the example data are shown below with the variable being catecholamine, smoking, and CHD and the data in a text file is here:

0, 0, 0
0, 1, 0
1, 1, 1
1, 1, 0
0, 1, 0
0, 1, 0
0, 1, 1
0, 0, 0
0, 1, 0
0, 0, 0

... (plus 599 additional lines)

The results of the analysis would be:

Odds Ratios and 95% Confidence Intervals...
Variable   O.R.   Low -- High
1         2.9074 1.7079 4.9492
2         2.0000 1.1206 3.5695

The interpretation would be that individuals with an elevated catecholamine level ("Variable 1" in the above output) have an odds of CHD about 2.9 times greater than those with normal catecholamine levels controlling for cigarette smoking.  Cigarette smokers ("Variable" 2 in the above output) have twice the odds (2.0) of CHD compared to nonsmokers controlling for catecholamine (elevated vs. normal).

Data at the individuals level, two exposure variables - interaction model

If you would like to assess the interaction between two variables, there will need to be an interaction term.  Using the data from the previous example, the question might be whether cigarette smoking modifies the catecholamine->CHD relationship.  The interaction term is simply multiplying the value for catecholamine times the value for smoking, of which there are only four possibilities with these two variables:
Catecholamine   Smoking   Interaction
1 x 1 = 1
1 x 0 = 0
0 x 1 = 0
0 x 0 = 0

The data would be in the following format:

first exposure variable value, second exposure variable value, interaction value, outcome variable value

For this example data the number of data points is 609 and the number of predictor variables is 3.  The first 10 records from the example data are shown below with the variables being catecholamine, smoking, the catecholamine-smoking interaction, and CHD and the data file as text can be found here:

0, 0, 0, 0
0, 1, 0, 0
1, 1, 1, 1
1, 1, 1, 0
0, 1, 0, 0
0, 1, 0, 0
0, 1, 0, 1
0, 0, 0, 0
0, 1, 0, 0
0, 0, 0, 0

... (plus 599 additional lines)

The results of the analysis would be:

Coefficients and Standard Errors...
Variable   Coeff. StdErr    p
1          1.3953 0.5187 0.0072
2          0.8653 0.3864 0.0251
3         -0.4498 0.6092 0.4603
Intercept -2.9267

Odds Ratios and 95% Confidence Intervals...
Variable    O.R.   Low -- High
1          4.0360 1.4601 11.1562
2          2.3758 1.1141 5.0661
3          0.6377 0.1932 2.1049

The interpretation would be that the interaction is not statistically significant (p-value for variable 3 = 0.4603) and could be removed from the model.  Another way to tell that the interaction is not significant is based on the odds ratio confidence interval for the interaction term; the null value (when there is no interaction) for an interaction term is 1; the 95% confidence interval for the odds ratio around the interaction term goes from 0.1932 to 2.1049 which includes the "null value" of 1.

Summary data, one exposure variable

This program can also analyze summary data.  For example, the table below summarizes information on 609 individuals by exposure (catecholamine) and disease (CHD):
Elevated Catecholamine? CHD (Disease variable)
(Exposure variable) Yes (1) No (0)
     Yes (1) 27 95
     No (0) 44 443

The data can be entered as summary data in two lines in the format:

exposure variable level, number without disease at this exposure level, number with disease at this exposure level

For this example data the number of data points is 2, the number of predictor variables is 1, and check the summary data box.  The complete example data are shown below with the variable being exposure category, number without CHD in exposure category, and number with CHD in exposure category.  You could copy these data and paste them in the Data Window.

1, 95, 27
0, 443, 44

The results of the analysis would be as follows, exactly the same as the Data at the individuals level, one exposure variable example shown previously based on the same data.

Odds Ratios and 95% Confidence Intervals...
Variable  O.R.    Low -- High
1         2.8615 1.6878 4.8514

Summary data, two exposure variables

 In this example is described a situation where there are two exposure levels, one considered as the primary exposure of interest and another as potentially an effect modifier, confounder, significant independent exposure, or none of these.  As an example, an investigators are interested in the relationship between an elevated catecholamine and CHD, but want to determine if this relationship is affected by the smoking status of the individual.  The data are as follows:

Smoke = Yes (1)
Elevated Catecholamine? CHD (Disease variable)
(Exposure variable) Yes (1) No (0)
     Yes (1) 19 58
     No (0) 35 275

Smoke = No (0)
Elevated Catecholamine? CHD (Disease variable)
(Exposure variable) Yes (1) No (0)
     Yes (1) 8 37
     No (0) 9 168

First, to see if smoking modifies the catecholamine->CHD relationship, enter data to determine if the interaction between catecholamine and smoking is statistically significant.  The interaction level would be determined similarly to that described previously.

exposure variable 1 level, exposure variable 2 level, interaction level, number without disease at this level, number with disease at this level.

For this example data the number of data points is 4, the number of predictor variables is 3, and check the summary data box.  The complete example data are shown below with the variables being cateholamine category, smoking category, interaction category, number without CHD at these levels, and number with CHD at these levels.  You could copy these data and paste them in the Data Window.

1, 1, 1, 58, 19
0, 1, 0, 275, 35
1, 0, 0, 37, 8
0, 0, 0, 168, 9

The results of the analysis would be:

Coefficients and Standard Errors...
Variable   Coeff. StdErr    p
1          1.3953 0.5187 0.0072
2          0.8653 0.3864 0.0251
3         -0.4498 0.6092 0.4603
Intercept -2.9267

Odds Ratios and 95% Confidence Intervals...
Variable    O.R.   Low -- High
1          4.0360 1.4601 11.1562
2          2.3758 1.1141 5.0661
3          0.6377 0.1932 2.1049

The interpretation would be that the interaction is not statistically significant (p-value for variable 3 = 0.4603) and could be removed from the model. 

To determine whether smoking confounds the catecholamine->CHD association, two odds ratios are needed, a "crude" odds ratio from a logistic regression model with just catecholamine as a predictor of CHD which was 2.8615, and a logistic regression model with two predictors in the model, catecholamine and smoking.  The general format for the summary data is:

exposure variable 1 level, exposure variable 2 level, number without disease at this level, number with disease at this level

For this example data the number of data points is 4, the number of predictor variables is 2, and check the summary data box.  The complete example data are shown below with the variables being cateholamine category, smoking category, number without CHD at these levels, and number with CHD at these levels.  You could copy these data and paste them in the Data Window.

1, 1, 58, 19
0, 1, 275, 35
1, 0, 37, 8
0, 0, 168, 9

The results of the analysis would be:

Odds Ratios and 95% Confidence Intervals...
Variable   O.R.   Low -- High
1         2.9074 1.7079 4.9492
2         2.0000 1.1206 3.5695

The interpretation would be that individuals with an elevated catecholamine level ("Variable 1" in the above output) have an odds of CHD 2.9074 times greater than those with normal catecholamine levels controlling for cigarette smoking.  Cigarette smokers ("Variable" 2 in the above output) have twice the odds (2.0000) of CHD compared to nonsmokers controlling for catecholamine (elevated vs. normal).  For the question of whether or not smoking confounds the catecholamine->CHD association, compare the crude odds ratio (2.8615) with the odds ratio adjusted for smoking (2.9074) - as a general rule, if these two differ by 10% or more, then confounding is present; if less than 10%, there is not an important amount of confounding.  (Note that some investigators may choose to define confounding differently, perhaps at a 5% difference.)  In this example, there is little evidence of confounding.  However, smoking does seem to be an important independent predictor of CHD when controlling for catecholamine.



Return to the Logistic Regression page, or to the Interactive Statistics main page, or to the JCP Home Page