## Logistic Regression

by John C. Pezzullo
Revised 2015-07-22: Apply fractional shifts for the first few iterations, to increase robustness for ill-conditioned data.

This page performs logistic regression, in which a dichotomous outcome is predicted by one or more variables. The program generates the coefficients of a prediction formula (and standard errors of estimate and significance levels), and odds ratios (with confidence intervals).

### Instructions:

1. Enter the number of data points: (or, if summary data, the number of lines of data).

2. Enter the number of predictor variables:

3. Enter the confidence level: %

4. If you're entering summary data, check here

5. Type or paste data in the window below.
Predictor variable(s) first, then outcome variable (1 if event occurred; 0 if it did not occur).
If summary data box checked (Step 4), enter outcome as 2 columns: # of non-occurrences, then # of occurrences.
Columns must be separated by commas or tabs.
See Kevin Sullivan's page for more examples of how to enter data.

1. Click the button; results will appear in the window below:

1. To print out results, copy (Ctrl-C) and paste (Ctrl-V) the contents of the results Window to a word processor or text editor, then print the results from that program. For best appearance, use a fixed-width font like Courier.

### Questions or Problems?

#### *** Not getting correct results or blank results?

If you are not getting numeric results or an error message, please assure the following:

• For each record or line of data, the data must be separated by a comma or tab; if there are just spaces between the data, you will get an error message or output with no calculated values.
• All data values must be numeric. Character data (such as "Y" or "Yes" or "+") will not work.
• The outcome variable must have a 1 or 0 coding.
• There cannot be any blank lines in the data.
• All records must have values for every predictor variable.

### *** One (or more) of my coefficients came out very large (and the standard error is even larger!). Why did this happen?

This is probably due to what is called "the perfect predictor or the complete separation problem". This occurs when one of the predictor variables is perfectly divided into two distinct ranges for the two outcomes. For example, if you had an independent variable like Age, and everyone above age 50 had the outcome event, and everyone 50 and below did not have the event, then the logistic algorithm will not converge (the regression coefficient for Age will take off toward infinity). The same thing can happen with categorical predictors. And it gets even more insidious when there's more than one predictor. None of the variables by themselves may look like "perfect predictors", but some subset of them taken together might form a pattern in n-dimensional space that can be sliced into two regions where everyone in one region had outcome=1 and everyone in the other region had outcome=0. This isn't a flaw in the web page; it's just that the logistic model is simply not appropriate for the data. The true relationship is a "step function", not the smooth "S-shaped" function of the logistic model.)

*** How do I copy and paste data?

Copy data: In most programs, you identify the data you want to copy then go to Edit->Copy

<Paste data: Open this logistic regression program; place the cursor in the data window and highlight the example data, then, in Windows, simultaneously press the Ctrl and V keys; Mac users press the Command and V keys.

*** Can I copy and paste from Excel?

Yes, highlight the columns with the data, Edit->Copy the data, and paste into the Logistic data window. Note that when you paste data from Excel into the data window, the different columns of data will be separated by a tab. You cannot see the tab in the data window, but you can usually tell the difference between a tab and blank spaces by placing the cursor in a line of data, then move the cursor to the right one space of a time - a tab will make the cursor move many spaces.

### Background Info (just what is logistic regression, anyway?):

Ordinary regression deals with finding a function that relates a continuous outcome variable (dependent variable y) to one or more predictors (independent variables x1, x2, etc.). Simple linear regression assumes a function of the form:
y = c0 + c1 * x1 + c2 * x2 +...
and finds the values of c0, c1, c2, etc. (c0 is called the "intercept" or "constant term").

Logistic regression is a variation of ordinary regression, useful when the observed outcome is restricted to two values, which usually represent the occurrence or non-occurrence of some outcome event, (usually coded as 1 or 0, respectively). It produces a formula that predicts the probability of the occurrence as a function of the independent variables.

Logistic regression fits a special s-shaped curve by taking the linear regression (above), which could produce any y-value between minus infinity and plus infinity, and transforming it with the function:
p = Exp(y) / ( 1 + Exp(y) )
which produces p-values between 0 (as y approaches minus infinity) and 1 (as y approaches plus infinity). This now becomes a special kind of non-linear regression, which is what this page performs.

Logistic regression also produces Odds Ratios (O.R.) associated with each predictor value. The odds of an event is defined as the probability of the outcome event occurring divided by the probability of the event not occurring. The odds ratio for a predictor tells the relative amount by which the odds of the outcome increase (O.R. greater than 1.0) or decrease (O.R. less than 1.0) when the value of the predictor value is increased by 1.0 units.

### Techie-stuff (for those who might be interested):

This page contains a straightforward JavaScript implementation of a standard iterative method to maximize the Log Likelihood Function (LLF), defined as the sum of the logarithms of the predicted probabilities of occurrence for those cases where the event occurred and the logarithms of the predicted probabilities of non-occurrence for those cases where the event did not occur.

Maximization is by Newton's method, with a very simple elimination algorithm to invert and solve the simultaneous equations. Central-limit estimates of parameter standard errors are obtained from the diagonal terms of the inverse matrix. Odds Ratios and their confidence limits are obtained by exponentiating the parameters and their lower and upper confidence limits, approximated by +/- 1.96 standard errors (for 95% limits).

No special convergence-acceleration techniques are used. For improved precision, the independent variables are temporarily converted to "standard scores" ( value - Mean ) / StdDev. The Null Model is used as the starting guess for the iterations -- all parameter coefficients are zero, and the intercept is the logarithm of the ratio of the number of cases with y=1 to the number with y=0. The quantity -2*Ln(Likelihood) is displayed for the null model, for each step of the iteration, and for the final (converged model). Convergence is not guaranteed, but this page should work properly with most practical problems that arise in real-world situations.

This implementation has no predefined limits for the number of independent variables or cases. The actual limits are probably dependent on your web browser's available memory and other browser-specific restrictions.

Reference: Applied Logistic Regression, by D.W. Hosmer and S. Lemeshow. 1989, John Wiley & Sons, New York