Understanding the Results of an Analysis
NLREG prints a variety of statistics at the end of each
analysis. For each variable, NLREG lists the minimum value, the maximum value,
the mean value, and the standard deviation. You should confirm that these
values are within the ranges you expect.
For each parameter, NLREG
displays the initial parameter estimate (which you specified on the PARAMETER
statement, or 1 by default), the final (maximum likelihood) estimate, the
standard error of the estimated parameter value, the "t'' statistic
comparing the estimated parameter value with zero, and the significance of the
t statistic. Nine significant digits
are displayed for the parameter estimates.
If you need to determine the parameters to greater precision, use the
POUTPUT statement.
The final estimate parameter values are the results of the
analysis. By substituting these values in the equation you specified to be
fitted to the data, you will have a function that can be used to predict the
value of the dependent variable based on a set of values for the independent
variables. For example, if the equation
being fitted is
y = p0 + p1*x
and the final estimates are 1.5 for p0 and 3 for p1, then
the equation
y = 1.5 + 3*x
is the best equation of this form that will predict the
value of y based on the value of x.
The "t'' statistic is
computed by dividing the estimated value of the parameter by its standard
error. This statistic is a measure of
the likelihood that the actual value of the parameter is not zero. The larger
the absolute value of t, the less likely that the actual value of the parameter
could be zero.
The "Prob(t)'' value is
the probability of obtaining the estimated value of the parameter if the actual
parameter value is zero. The smaller
the value of Prob(t), the more significant the parameter and the less likely
that the actual parameter value is zero.
For example, assume the estimated value of a parameter is 1.0 and its
standard error is 0.7. Then the t value
would be 1.43 (1.0/0.7). If the
computed Prob(t) value was 0.05 then this indicates that there is only a 0.05
(5%) chance that the actual value of the parameter could be zero. If Prob(t) was 0.001 this indicates there is
only 1 chance in 1000 that the parameter could be zero. If Prob(t) was 0.92 this indicates that
there is a 92% probability that the actual value of the parameter could be
zero; this implies that the term of the regression equation containing the
parameter can be eliminated without significantly affecting the accuracy of the
regression.
One thing that can cause Prob(t) to be 1.00 (or near 1.00)
is having redundant parameters. If at
the end of an analysis several parameters have Prob(t) values of 1.00, check
the function carefully to see if one or more of the parameters can be removed.
Also try using a DOUBLE statement to set one or more of the parameters to a
reasonable fixed value; if the other parameters suddenly become significant
(i.e., Prob(t) much less than 1.00) then the parameters are mutually dependent
and one or more should be removed.
The t statistic probability is
computed using a two-sided test. The
CONFIDENCE statement can be used to cause NLREG to print confidence intervals
for parameter values. The SQUARE.NLR
example regression includes an extraneous parameter (p0) whose estimated value
is much smaller than its standard error; the Prob(t) value is 0.99982
indicating that there is a high probability that the value is zero.
In addition to the variable and parameter values, NLREG
displays several statistics that indicate how well the equation fits the
data. The "Final sum of squared
deviations'' is the sum of the squared differences between the actual value of
the dependent variable for each observation and the value predicted by the
function, using the final parameter estimates.
The "Average deviation''
is the average over all observations of the absolute value of the difference
between the actual value of the dependent variable and its predicted value.
The "Maximum deviation
for any observation'' is the maximum difference (ignoring sign) between the
actual and predicted value of the dependent variable for any observation.
The "Proportion of
variance explained (R2)'' indicates how much better the function predicts the
dependent variable than just using the mean value of the dependent
variable. This is also known as the
"coefficient of multiple determination.''
It is computed as follows: Suppose that we did not fit an equation to
the data and ignored all information about the independent variables in each
observation. Then, the best prediction
for the dependent variable value for any observation would be the mean value of
the dependent variable over all observations.
The "variance'' is the sum of the squared differences between the
mean value and the value of the dependent variable for each observation. Now, if we use our fitted function to
predict the value of the dependent variable, rather than using the mean value,
a second kind of variance can be computed by taking the sum of the squared
difference between the value of the dependent variable predicted by the function
and the actual value. Hopefully, the
variance computed by using the values predicted by the function is better
(i.e., a smaller value) than the variance computed using the mean value. The "Proportion of variance explained''
is computed as 1 – (variance using predicted value / variance using mean). If the function perfectly predicts the
observed data, the value of this statistic will be 1.00 (100%). If the function does no better a job of
predicting the dependent variable than using the mean, the value will be 0.00.
The "adjusted coefficient
of multiple determination (Ra2)'' is an R2 statistic
adjusted for the number of parameters in the equation and the number of data
observations. It is a more conservative
estimate of the percent of variance explained, especially when the sample size
is small compared to the number of parameters.
The "Durbin-Watson test
for autocorrelation'' is a statistic that indicates the likelihood that the
deviation (error) values for the regression have a first-order autoregression
component. The regression models assume
that the error deviations are uncorrelated.
In business and economics,
many regression applications involve time series data. If a non-periodic function, such as a
straight line, is fitted to periodic data, the deviations have a periodic form
and are positively correlated over time; these deviations are said to be
"autocorrelated'' or "serially correlated.'' Autocorrelated deviations may also indicate
that the form (shape) of the function being fitted is inappropriate for the
data values (e.g., a linear equation fitted to quadratic data).
If the deviations are autocorrelated, there may be a number
of consequences for the computed results: 1) The estimated regression
coefficients no longer have the minimum variance property; 2) the mean square
error (MSE) may seriously underestimate the variance of the error terms; 3) the
computed standard error of the estimated parameter values may underestimate the
true standard error, in which case the t values and confidence intervals may be
incorrect. Note that if an appropriate
periodic function is fitted to periodic data, the deviations from the
regression will be uncorrelated because the cycle of the data values is
accounted for by the fitted function.
Small values of the Durbin-Watson statistic indicate the
presence of autocorrelation. Consult
significance tables in a good statistics book for exact interpretations;
however, a value less than 0.80 usually indicates that autocorrelation is
likely. If the Durbin-Watson statistic
indicates that the residual values are autocorrelated, it is recommended that
you use the RPLOT and/or NPLOT statements to display a plot of the residual
values.
If the data has a regular,
periodic component you can try including a sin term in your function. The TREND.NLR example fits a function with a
sin term to data that has a linear growth with a superimposed sin
component. With the sin term the
function has a residual value of 29.39 and a Durbin-Watson value of 2.001;
without the sin term (i.e., fitting only a linear function) the residual value
is 119.16 and the Durbin-Watson value is 0.624 indicating strong
autocorrelation. The general form of a
sin term is
amplitude * sin(2*pi*(x-phase)/period)
where amplitude is a parameter that
determines the magnitude of the sin component, period determines the period of the oscillation, and phase determines the phase relative to
the starting value. If you know the
period (e.g., 12 for monthly data with an annual cycle) you should specify it
rather than having NLREG attempt to determine it.
If an NPLOT statement is used
to produce a normal probability plot of the residuals, the correlation between
the residuals and their expected values (assuming they are normally
distributed) is printed in the listing.
If the residuals are normally distributed, the correlation should be
close to 1.00. A correlation less than
0.94 suggests that the residuals are not normally distributed.
An "Analysis of Variance''
table provides statistics about the overall significance of the model being
fitted.
The "F value'' and
"Prob(F)'' statistics test the overall significance of the regression
model. Specifically, they test the null
hypothesis that all of the regression
coefficients are equal to zero. This
tests the full model against a model with no variables and with the estimate of
the dependent variable being the mean of the values of the dependent variable. The F value is the ratio of the mean
regression sum of squares divided by the mean error sum of squares. Its value will range from zero to an
arbitrarily large number.
The value of Prob(F) is the probability that the
null hypothesis for the full model is true (i.e., that all of the regression
coefficients are zero). For example, if
Prob(F) has a value of 0.01000 then there is 1 chance in 100 that all of the
regression parameters are zero. This
low a value would imply that at least some of the regression parameters are
nonzero and that the regression equation does have some validity in fitting the
data (i.e., the independent variables are not purely random with respect to the
dependent variable).
The CORRELATE statement can be
used to cause NLREG to print a correlation matrix. A "correlation coefficient'' is a value that indicates
whether there is a linear relationship between two variables. The absolute value of the correlation
coefficient will be in the range 0 to 1.
A value of 0 indicates that there is no relationship whereas a value of
1 indicates that there is a perfect correlation and the two variables vary
together. The sign of the correlation
coefficient will be negative if there is an inverse relationship between the
variables (i.e., as one increases the other decreases).
For example, consider a study measuring the height and
weight of a group of individuals. The
correlation coefficient between height and weight will likely have a positive
value somewhat less than one because tall people tend to weigh more than short
people. A study comparing number of
cigarettes smoked with age at death will probably have a negative correlation
value.
A correlation matrix shows the correlation between each pair
of variables. The diagonal of the
matrix has values of 1.00 because a variable always has a perfect correlation
with itself. The matrix is symmetric
about the diagonal because X correlated with Y is the same as Y correlated with
X.
Problems occur in regression analysis when a function is
specified that has multiple independent variables that are highly
correlated. The common interpretation
of the computed regression parameters as measuring the change in the expected
value of the dependent variable when the corresponding independent variable is
varied while all other independent variables are held constant is not fully
applicable when a high degree of correlation exists. This is due to the fact that with highly correlated independent
variables it is difficult to attribute changes in the dependent variable to one
of the independent variables rather than another. The following are effects of fitting a function with high
correlated independent variables:
1.
Large changes in the estimated regression parameters may
occur when a variable is added or deleted, or when an observation is added or
deleted.
2.
Individual tests on the regression parameters may show the
parameters to be nonsignificant.
3.
Regression parameters may have the opposite algebraic sign
than expected from theoretical or practical considerations.
4.
The confidence intervals for important regression
parameters may be be much wider than would otherwise be the case. The solution to these problems may be to select
the most significant of the correlated variables and use only it in the
function.
Note: the correlation coefficients indicate the degree of linear association between variables.
Variables may be highly related in a nonlinear fashion and still have a
correlation coefficient near 0.
NLREG home page