Education 257 SOLUTIONS HW3 3/08/05

1.
This problem is concerned with using hospital stay (X) to estimate
prognosis (Y).
15 ROWS READ
ROW C1 C2
1 2 54
2 5 50
3 7 45
4 10 37
. . .
****** Straightline Model ******
MTB > name c1 'Days'
MTB > name c2 'PIndex'
MTB > plot c2 c1
PIndex  *
 *

45+ *


 * *

30+

 *
 *
 *
15+ *
 * *
 * *
 *
 *
0+
++++++Days
0 12 24 36 48 60
MTB > name c22 'Resids'
MTB > regress c2 1 c1 c20 c21;
SUBC> residuals c22.
The regression equation is
PIndex = 46.5  0.753 Days
Predictor Coef Stdev tratio
Constant 46.460 2.762 16.82
Days 0.75251 0.07502 10.03
s = 5.891 Rsq = 88.6% Rsq(adj) = 87.7%
Analysis of Variance
SOURCE DF SS MS
Regression 1 3492.1 3492.1
Error 13 451.2 34.7
Total 14 3943.3
MTB > plot c22 c1
 *
Resids  *
 *

5.0+
 * *
 *

 *
0.0+
 *
 *
 *

5.0+ * *

 * * *


++++++Days
0 12 24 36 48 60
Straightline Model Prediction
SUBC> predict 10;
SUBC> predict 60.
Fit Stdev.Fit 95% C.I. 95% P.I.
38.94 2.18 ( 34.23, 43.64) ( 25.36, 52.51)
1.31 2.67 ( 4.46, 7.08) ( 12.67, 15.29)
From the plot of Prognosis Index on Days and that of Residuals a
straightline model is not your best bet. (even though healthy Rsquare for
the straightline fit)
TRANSFORMATION TO STRAIGHTEN
From the plot of Prognosis Index vs Days it appears that a log transformation
or a squareroot transformation on x may "straighten" the relationship
between the two. recall Rule of Bulge would indicate Xdown or Ydown
****** Log Transformation ******
MTB > let c4 = loge(c1)
here for use in the predict subcommand we need to define our
stated Xvalues (hospital stay) into the new metric
MTB > let k1 = loge(10)
MTB > let k2 = loge(60)
MTB > name c4 'lnDays'
MTB > plot c2 c4
PIndex  *
 *

45+ *


 * *

30+

 *
 *
 *
15+ *
 * *
 * *
 *
 *
0+
++++++lnDays
0.70 1.40 2.10 2.80 3.50 4.20
MTB > regress c2 1 c4 c20 c21;
SUBC> residuals c22.
The regression equation is
PIndex = 72.3  16.0 lnDays
Predictor Coef Stdev tratio
Constant 72.283 2.704 26.73
lnDays 15.9662 0.8386 19.04
s = 3.241 Rsq = 96.5% Rsq(adj) = 96.3%
Analysis of Variance
SOURCE DF SS MS
Regression 1 3806.8 3806.8
Error 13 136.5 10.5
Total 14 3943.3
Unusual Observations
Obs. lnDays PIndex Fit Stdev.Fit Residual St.Resid
1 0.69 54.000 61.216 2.159 7.216 2.99RX
R denotes an obs. with a large st. resid.
X denotes an obs. whose X value gives it large influence.
MTB > plot c22 c4
 *

3.5+ * *

Resids  * *
 *
 *
0.0+ * *
 *
 * *

 *
3.5+ *




7.0+ *

++++++lnDays
0.70 1.40 2.10 2.80 3.50 4.20
SUBC> predict k1;
SUBC> predict k2.
Fit Stdev.Fit 95% C.I. 95% P.I.
35.520 1.054 ( 33.243, 37.796) ( 28.156, 42.883)
6.912 1.202 ( 4.315, 9.509) ( 0.557, 14.381)
The plot is closer to straightline. But the residual plot still
shows a pattern. Nontheless the model works better than straightline fit
to the original data.
****** Squareroot transformation ******
just for illustration we'll also show you the same stuff for a squareroot
transformation of hospital stay ( a reasonable choice also)
MTB > let c5 = sqrt(c1)
MTB > let k3 = sqrt(10)
MTB > let k4 = sqrt(60)
MTB > name c5 'Days^.5'
MTB > plot c2 c5
PIndex  *
 *

45+ *


 * *

30+

 *
 *
 *
15+ *
 * *
 * *
 *
 *
0+
+++++Days^.5
1.5 3.0 4.5 6.0 7.5
MTB > regress c2 1 c5 c20 c21;
SUBC> residuals c22.
The regression equation is
PIndex = 63.2  7.75 Days^.5
Predictor Coef Stdev tratio
Constant 63.225 2.271 27.84
Days^.5 7.7481 0.4097 18.91
s = 3.262 Rsq = 96.5% Rsq(adj) = 96.2%
Analysis of Variance
SOURCE DF SS MS
Regression 1 3805.0 3805.0
Error 13 138.3 10.6
Total 14 3943.3
MTB > plot c22 c5
 *
Resids 
 *
 *
3.0+
 *
 *
 *
 * *
0.0+ *


 *
 *
3.0+ *
 *
 * *


+++++Days^.5
1.5 3.0 4.5 6.0 7.5
SUBC> predict k3;
SUBC> predict k4.
Fit Stdev.Fit 95% C.I. 95% P.I.
38.723 1.171 ( 36.193, 41.254) ( 31.235, 46.212)
3.209 1.357 ( 0.276, 6.141) ( 4.425, 10.842)
The residual plot still shows a curvilinear pattern.
======================================
b.
Try adding in a quadratic term in Days:
******** Quadratic Model ******
MTB > let c3 = c1*c1
MTB > name c3 'Days^2'
MTB > regress c2 2 c1 c3 c20 c21;
SUBC> residuals c22.
The regression equation is
PIndex = 55.8  1.71 Days + 0.0148 Days^2
Predictor Coef Stdev tratio
Constant 55.822 1.649 33.85
Days 1.7103 0.1248 13.70
Days^2 0.014807 0.001868 7.93
s = 2.455 Rsq = 98.2% Rsq(adj) = 97.9%
Analysis of Variance
SOURCE DF SS MS
Regression 2 3871.0 1935.5
Error 12 72.3 6.0
Total 14 3943.3
SOURCE DF SEQ SS
Days 1 3492.1
Days^2 1 378.9
MTB > plot c22 c1

Resids  *

 *
2.5+ *

 *
 * *
 *
0.0+ *

 * * *
 *

2.5+ *
 *
 *


++++++Days
0 12 24 36 48 60
To obtain predictions for 10 and 60 days use PREDICT subc in regress
Note here you need to enter x and xsquared for predict to work.
SUBC> predict 10 100;
SUBC> predict 60 3600.
Fit Stdev.Fit 95% C.I. 95% P.I.
40.200 0.921 ( 38.194, 42.207) ( 34.485, 45.915)
6.511 1.292 ( 3.695, 9.328) ( 0.465, 12.558)
Examination of the residuals and the squared multiple correlation show
that this model fits the data much better.
Compare the predictions and interval estimates from the straightline fit to
the original data.
The interval estimates for the quadratic model are about one half the
width of those from the straightline model.
Moreover for an extreme X value (e.g. 60)
the point estimate from the straightline model is too small as the model
does not account for the curvature in the relationship.
Fitting the quadratic seems to work slightly better than the transformations.

2.
Bodyfat data revisited
I redid the analyses for this, which are fully
available from the text example or from bodyfat.out
Just to take up more bytes, recreate the relevant parts here.
MTB > read 'bodyfat.dat' c1c4
20 ROWS READ
ROW C1 C2 C3 C4
1 19.5 43.1 29.1 11.9
2 24.7 49.8 28.2 22.8
3 30.7 51.9 37.0 18.7
4 29.8 54.3 31.1 20.1
. . .
MTB > name c1 'triceps' c2 'thigh' c3 'midarm' c4 'bodyfat'
To find the single best predictor of bodyfat, look at the correlations
of each predictor separately with bodyfat:
MTB > corr c1c4
triceps thigh midarm
thigh 0.924
midarm 0.458 0.085
bodyfat 0.843 0.878 0.142
The correlations show that thigh circumference is the best single
predictor (i.e., the regression of bodyfat on thigh will have a higher
Rsquared than will the regression of bodyfat on either of the remaining
predictors.) Triceps skinfold thickness is a close second.
Regression of bodyfat on thigh:
MTB > regress 'bodyfat' 1 'thigh'
The regression equation is
bodyfat =  23.6 + 0.857 thigh
Predictor Coef Stdev tratio p
Constant 23.634 5.657 4.18 0.001
thigh 0.8565 0.1100 7.79 0.000
s = 2.510 Rsq = 77.1% Rsq(adj) = 75.8%
The regression coefficient for thigh in the singlepredictor equation is
0.8565, with a standard error of 0.1100. The corresponding tstatistic
is 7.79, (which the pvalue of 0.000 shows is significant at any Type I
error rate we might choose.)
Here's the regression of bodyfat on triceps:
MTB > regress 'bodyfat' 1 'triceps'
The regression equation is
bodyfat =  1.50 + 0.857 triceps
Predictor Coef Stdev tratio p
Constant 1.496 3.319 0.45 0.658
triceps 0.8572 0.1288 6.66 0.000
s = 2.820 Rsq = 71.1% Rsq(adj) = 69.5%
Triceps also looks like a good predictor when it is used by itself.
Regression using both triceps and thigh as predictors:
MTB > regress 'bodyfat' 2 'triceps' 'thigh'
The regression equation is
bodyfat =  19.2 + 0.222 triceps + 0.659 thigh
Predictor Coef Stdev tratio p
Constant 19.174 8.361 2.29 0.035
triceps 0.2224 0.3034 0.73 0.474
thigh 0.6594 0.2912 2.26 0.037
s = 2.543 Rsq = 77.8% Rsq(adj) = 75.2%
Now the coefficient for thigh is 0.6594, which is fairly close to the
value of 0.8565 from the onepredictor equation. The standard error has
increased from 0.1100 to 0.2912 (variance increased by a factor of 7),
resulting in a decrease in the tstatistic from 7.79 to 2.26. Thus the
tstatistic from the 2predictor equation is almost 3.5 times smaller
than that from the 1predictor equation. The pvalue reveals that the
coefficient for thigh would not be considered significantly different
from 0 if we used a familywise Type I error rate of .05.
The coefficient for triceps changes substantially when we move from the
1predictor to the 2predictor equation. It decreases from 0.8572 to
0.2224, and its standard error increases from 0.1288 to 0.3034 (variance
increases by a factor of about 5.5). The tstatistic decreases from
6.66 to 0.73, so the new tstatistic is 9 times smaller than that from
the singlepredictor equation. Triceps becomes nonsignificant
(p=0.474).
Here's the regression using all three predictors.
MTB > regress 'bodyfat' 3 'triceps' 'thigh' 'midarm'
* NOTE * triceps is highly correlated with other predictor variables
* NOTE * thigh is highly correlated with other predictor variables
* NOTE * midarm is highly correlated with other predictor variables
The regression equation is
bodyfat = 117 + 4.33 triceps  2.86 thigh  2.19 midarm
Predictor Coef Stdev tratio p
Constant 117.08 99.78 1.17 0.258
triceps 4.334 3.016 1.44 0.170
thigh 2.857 2.582 1.11 0.285
midarm 2.186 1.595 1.37 0.190
s = 2.480 Rsq = 80.1% Rsq(adj) = 76.4%
Now the coefficient for thigh circumference has become negative: 2.857,
as compared with 0.8565 from the singlepredictor equation and 0.6594
from the 2predictor equation. Its standard error is much larger than
before (2.582), resulting in a tstatistic that is quite small: 1.11,
as compared with 7.79 and 2.26 from the previous two equations. So even
though the coefficient is negative, we can't say that it's significantly
different from zero (p=0.285). In other words, thigh appears to be a
useless predictor once we include midarm and triceps, even though it was
the single best predictor of bodyfat when used alone.
The coefficient for triceps remained positive (4.334), and is
substantially larger than the coefficients for triceps in the 1 and 2
predictor equations (0.8572 and 0.2224, respectively). Its standard
error increased quite a bit as well (3.016, compared with 0.1288 and
0.3034 in the previous 2 equations). Its tstatistic is 1.44, which is
larger than that for the 2predictor equation (0.73) but still not
significant (p=0.170) despite the increase in the size of the
coefficient.

3. Patient Satisfaction
for reference (cf NWK ):
C1 = Y (patient satisfaction)
C2 = X1 (patient's age)
C3 = X2 (severity of illness)
C4 = X3 (anxiety level)
6.15 part a
MTB > read '/usr/class/ed257/patient.dat' into c1c4
23 ROWS READ
ROW C1 C2 C3 C4
1 48 50 51 2.3
2 57 36 46 2.3
3 66 40 48 2.2
4 70 41 44 1.8
. . .
MTB > stem c2c4
Stemandleaf of C2 N = 23
Leaf Unit = 1.0
5 2 89999
5 3
7 3 33
8 3 4
10 3 66
11 3 8
(2) 4 01
10 4 233
7 4 45
5 4
5 4 9
4 5 0
3 5 23
1 5 5
Stemandleaf of C3 N = 23
Leaf Unit = 1.0
1 4 3
2 4 4
4 4 66
9 4 88899
(6) 5 000111
8 5 23
6 5 445
3 5 6
2 5 8
1 6
1 6 2
Stemandleaf of C4 N = 23
Leaf Unit = 0.010
2 18 00
3 19 0
4 20 0
6 21 00
10 22 0000
(5) 23 00000
8 24 0000
4 25 0
3 26
3 27
3 28
3 29 000
Looking at these stemandleaf plots we see that X2 (severity of
illness) is approximately normally distributed, X1 (patient's
age) is approximately uniformly distributed, and X3 (anxiety level)
has several large extreme points at 29.
We should remain aware of how these points are influencing the regression
equation.
6.15c
fit the multiple regression with all linear terms
MTB > regress c1 3 c2 c3 c4 c5 c6;
SUBC> residuals c7.
The regression equation is
C1 = 163  1.21 C2  0.666 C3  8.6 C4
Predictor Coef Stdev tratio p
Constant 162.88 25.78 6.32 0.000
C2 1.2103 0.3015 4.01 0.001
C3 0.6659 0.8210 0.81 0.427
C4 8.61 12.24 0.70 0.490
s = 10.29 Rsq = 67.3% Rsq(adj) = 62.1%
Analysis of Variance
SOURCE DF SS MS F p
Regression 3 4133.6 1377.9 13.01 0.000
Error 19 2011.6 105.9
Total 22 6145.2
SOURCE DF SEQ SS
C2 1 3678.4
C3 1 402.8
C4 1 52.4
The estimate for beta(2) in this equation (i.e. .666) is the sample
coefficient for X2 when X1 X2 X3 are used to predict Y.
It can be interpreted (through partial regression plots, adjusted variables)
as the slope of the regression line produced by the regression of outcome
variable (Y) on the variable X2 dot X1 X3,
i.e. the adjusted variable X2 with the associations between X2
and X1 and X3 removed.
In order to understand this statement more clearly, let's derive
the partial regression coefficient for X2, using the
procedure shown in class (and the marks examples).
The partial regression slope
which we are looking for in this case is c1 c3 dot c2 c4 (read Y on X2 dot
X1 X3). First step in this procedure will be to regress the outcome variable
in c1 upon the predictor variables in c2 and c4.
MTB > regress c1 on 2 c2 c4;
SUBC> residuals c8.
The regression equation is
C1 = 147  1.24 C2  15.9 C4
Predictor Coef Stdev tratio p
Constant 147.08 16.73 8.79 0.000
C2 1.2434 0.2961 4.20 0.000
C4 15.891 8.256 1.92 0.069
s = 10.20 Rsq = 66.1% Rsq(adj) = 62.7%
Our next step is to regress X2 onto X1 and X3. Residuals from
this regression represent the part of in the X2 values not
predictable by X1 and X3 values.
MTB > regress c3 2 c2 c4;
SUBC> residuals c9.
The regression equation is
C3 = 23.7 + 0.0496 C2 + 10.9 C4
Predictor Coef Stdev tratio p
Constant 23.728 4.597 5.16 0.000
C2 0.04962 0.08135 0.61 0.549
C4 10.929 2.268 4.82 0.000
s = 2.802 Rsq = 63.8% Rsq(adj) = 60.2%
In order to get the partial regression slope, c1 c3 dot c2 c4, we
simply regress the residuals from the first regression onto the
residuals from the second regression.
A plot of c8 vs c9 is the scatterplot for this regression coefficient.
MTB > regress c8 1 c9
The regression equation is
C8 = 0.00  0.666 C9
Predictor Coef Stdev tratio p
Constant 0.000 2.041 0.00 1.000
C9 0.6659 0.7809 0.85 0.403
s = 9.787 Rsq = 3.3% Rsq(adj) = 0.0%
You can note that the coefficient of c9 in this equation is identical
to the coefficient of c3 in the original full regression equation.
The small Rsq for this regression (i.e. .03) suggests that in
combination with X1 and X3, X2 is pretty useless in providing additional
information about Y.
6.15d
In NWK 6.15d, we are asked to produce a boxplot of the residuals from
the full regression model. Here we use the simple residuals ei.
Recall, that we placed these residuals in c7.
MTB > boxp c7

I + I

++++++C7
18.0 12.0 6.0 0.0 6.0 12.0
MTB > stem c7
Stemandleaf of C7 N = 23
Leaf Unit = 1.0
1 1 6
5 1 4321
7 0 76
10 0 310
(6) 0 012234
7 0 577
4 1 2334
Both the boxpot and the stemandleaf plot suggest that there are no
outliers.
6.16a
NWK 6.16a asks us to examine whether or not there is a general
regression relation between the three predictor variables (X1, X2
and X3) and the outcome variable (Y). Our null hypothesis in this
case is that parameters beta(1), beta(2), and beta(3) all are
equal to zero. The alternative hypothesis is that any one of these
three coefficients is not equal to zero. The test statistic for this
hypothesis test is calculated as follows:
MSReg 1377.9
 =  = 13.01
MSE 105.9
As you may have noticed, minitab calculates this test statistic for
you. It should be compared with critical value F (.10;3,19). You
can get an exact value for this using the minitab invcdf command.
MTB > invcdf .90;
SUBC> f 3, 19.
0.9000 2.3970
Since 13.01 > 2.397, we can reject the null hypothesis of no regression
relation. As was pointed out in class, this isn't telling us
anything very exciting. This test doesn't tell us anything about
the individual predictor variables (X1, X2, X3).
6.16b
NWK 6.16b asks us to calculate interval estimates for beta(1),
beta(2), and beta(3) using an overall confidence coefficient .90.
Using the Bonferroni approximation, the confidence coefficient for each
interval would be (1  .10/3) = 1  .0333 = .9667. Once again we can use
the invcdf command to obtain the appropriate critical value: t(.0167; 19).
MTB > invcdf .9833;
SUBC> t 19.
0.9833 2.293
The interval estimates for each coefficient would be of the form:
betahat(k) +/ (2.293) * s.e. betahat(k)
The estimates for beta(k) and values for s.e. beta(k) can be gotten directly
from the first section of the minitab regression output. The interval
estimates are as follows (endpoints rounded to 3 significant digits):
For beta(1): 1.21 +/ (2.293)(.3015) = (1.90, .520)
For beta(2): .666 +/ (2.293)(.8210) = (2.55, 1.22)
For beta(3): 8.61 +/ (2.293)(12.24) = (36.7, 19.5)
the only CI that does not contain 0 is for beta(1)
(as also seen from tstatistics in the reg output).
6.16c
NWK 6.16c asks you to calculate the coefficient of multiple
correlation. Once again, we can get this information from the
minitab output. Minitab gives us the squared multiple correlation
(i.e. Rsq); For this regression, Rsq is .673
What NWK call The coefficient of multiple correlation is the positive
square root of this value so the
coefficient of multiple correlation is r = .82. This is the
simple correlation between Y and Yhat, the value you would
get from correlating c1 and c6 above.
6.17a
NWK 6.17a asks us to calculate a 90% confidence interval for the mean
response of Y when X1 = 35, X2 = 45, and X3 = 2.2. A point estimate can
be gotten by plugging the specified values for X1, X2, and X3 into the
regression equation.
Y(hat) = 162.88  (1.2103)(35)  (.6659)(45)  (8.61)(2.2)
= 162.88  42.3605  29.9655  18.942 = 71.612
The easiest way to compute the standard error of the fit (also the point
estimate, compare numerical accuracy above with value below) for this set
of X values is to use the minitab 'predict' subcommand.
MTB > regress c1 3 c2 c3 c4;
SUBC> predict 35 45 2.2.
The regression equation is
C1 = 163  1.21 C2  0.666 C3  8.6 C4
Predictor Coef Stdev tratio p
Constant 162.88 25.78 6.32 0.000
C2 1.2103 0.3015 4.01 0.001
C3 0.6659 0.8210 0.81 0.427
C4 8.61 12.24 0.70 0.490
s = 10.29 Rsq = 67.3% Rsq(adj) = 62.1%
Fit Stdev.Fit 95% C.I. 95% P.I.
71.60 4.44 ( 62.30, 80.90) ( 48.14, 95.06)
This automatically gives you a 95% confidence interval for the mean value
of Y when X1 = 35, X2 = 45, and X3 = 2.2. However, we wanted a 90%
confidence interval.
For the 90% confidence interval, we use as the critical value the 95th
percentile of the tdistribution with 19 dfi.e. t(.10/2, 19) or t(.05, 19).
This value is 1.729. The interval estimate is as follows:
Y(hat) +/ t(crit) * s.e.(fit) = 71.6 +/ (1.729)(4.444)
= (63.92, 79.28)
With 90% confidence, we can say that the mean response of Y when X1 = 35,
X2 = 45, and X3 = 2.2 falls between 63.92 and 79.28.
*******************************************
NWK 9.7 v4 asks us to prepare a partial regression plot for each
of the independent variables (i.e. X1, X2, X3) in the regression equation.
Since we already have the two needed sets of residuals for the
partial regression plot for X2 as a predictor of Y,
using X1 X2 X3, we can simply give the plot command.
MTB > plot c8 c9
 *

12+ * *
 * *
C8  *
 2 *
 * * *
0+ * * *

 * *

 * *
12+ *
 *
 * *


++++++C9
4.0 2.0 0.0 2.0 4.0 6.0
We can obtain the regression plots for the other two independent
variables in a similar way. For X1, we first need to regress
the outcome variable (Y) on X2 and X3 to obtain the set of residuals
MTB > regress c1 2 c3 c4;
SUBC> residuals c10.
The regression equation is
C1 = 164  1.11 C3  20.2 C4
Predictor Coef Stdev tratio p
Constant 164.23 34.15 4.81 0.000
C3 1.111 1.078 1.03 0.315
C4 20.23 15.76 1.28 0.214
s = 13.63 Rsq = 39.5% Rsq(adj) = 33.4%
Unusual Observations
Obs. C3 C1 Fit Stdev.Fit Residual St.Resid
11 48.0 89.00 62.33 5.26 26.67 2.12R
17 56.0 79.00 51.42 4.62 27.58 2.15R
R denotes an obs. with a large st. resid.
Next we need to regress X1 onto X2 and X3.
MTB > regress c2 2 c3 c4;
SUBC> residuals c11.
The regression equation is
C2 =  1.1 + 0.368 C3 + 9.60 C4
Predictor Coef Stdev tratio p
Constant 1.12 19.12 0.06 0.954
C3 0.3681 0.6034 0.61 0.549
C4 9.599 8.823 1.09 0.290
s = 7.632 Rsq = 26.1% Rsq(adj) = 18.8%
We can get the partial regression plot for X1 as a predictor
by plotting these two sets of residuals.
MTB > plot c10 c11
 *
C10  *


16+ *
 *
 * 2 *


0+ *
 * * *
 *
 * * 2 * * *
 * *
16+
 *

++++++C11
10.0 5.0 0.0 5.0 10.0 15.0
Looks like a reasonable scatterplot with a strong negative slope
Similarly, for X3 we first regress Y on X1 and X2.
MTB > regress c1 2 c2 c3;
SUBC> residuals c12.
The regression equation is
C1 = 167  1.26 C2  1.09 C3
Predictor Coef Stdev tratio p
Constant 166.59 24.91 6.69 0.000
C2 1.2605 0.2892 4.36 0.000
C3 1.0893 0.5514 1.98 0.062
s = 10.16 Rsq = 66.4% Rsq(adj) = 63.1%
Next, we regress X3 onto X1 and X2.
MTB > regress c4 2 c2 c3;
SUBC> residuals c13.
The regression equation is
C4 =  0.431 + 0.00582 C2 + 0.0492 C3
Predictor Coef Stdev tratio p
Constant 0.4314 0.4608 0.94 0.360
C2 0.005821 0.005350 1.09 0.290
C3 0.04916 0.01020 4.82 0.000
s = 0.1880 Rsq = 65.2% Rsq(adj) = 61.7%
Unusual Observations
Obs. C2 C4 Fit Stdev.Fit Residual St.Resid
6 49.0 2.9000 2.5085 0.0600 0.3915 2.20R
R denotes an obs. with a large st. resid.
MTB > plot c12 c13
 *
 *
12+ * *

C12  * *
 * * *
 * * **
0+ *
 * *

 *
 *
12+ * 2
 *
 *


++++++C13
0.30 0.15 0.00 0.15 0.30 0.45
Of the three partial regression plots, only the one for X1 reveals
a somewhat strong negative association. The other two reveal weak
negative associations. This may suggest that either or both of X2
and X3 may not be very useful in improving prediction of Y.
From the full multiple regression note that X1 is the only
predictor with a big tstatistic.

4.
a)
MTB > read 'readiq.dat' c1c4
60 ROWS READ
ROW C1 C2 C3 C4
1 1 16 41 90
2 1 15 36 84
3 1 15 40 86
4 1 15 35 85
. . .
MTB > name c1 'readabil' c2 'attn' c3 'spatial' c4 'wisciq'
MTB > lplot c2 c3 c1
40+
 C C
attn  C C C C
 C C C C C
 C 2 C C 2 C
30+ 3 C C C 2 C
 C B
 B
 C B B B
 A 2 2 A
20+ A B B B 2 2 2
 B A B
 A A A A A

 A
10+

++++++spatial
24.5 28.0 31.5 35.0 38.5 42.0
This plot shows no strong linear relationship between the two subscale scores,
for all three reading groups combined. There may be seen some visual evidence
of a slight negative linear relationship, but it's not very big.
Within groups there is also little evidence of strong linear
relationship between the subscales. Normal readers tend to perform better on
the attention/concentration scale than do the other two groups, but may be at
a slight disadvantage on the spatial scale.
b)
First, select observations for students classified as normal.
MTB > copy c2c4 c5c7;
SUBC> use c1=3.
MTB > name c5 'attnnorm' c6 'spatnorm' c7 'iqnorm'
MTB > regress c7 2 c5 c6 c8 c9;
SUBC> resids c10;
SUBC> predict 32 30.
The regression equation is
iqnorm = 41.6 + 0.748 attnnorm + 1.18 spatnorm
Predictor Coef Stdev tratio p
Constant 41.61 12.87 3.23 0.003
attnnorm 0.7479 0.2878 2.60 0.015
spatnorm 1.1822 0.2602 4.54 0.000
s = 4.541 Rsq = 47.9% Rsq(adj) = 44.1%
Analysis of Variance
SOURCE DF SS MS F p
Regression 2 512.69 256.34 12.43 0.000
Error 27 556.68 20.62
Total 29 1069.37
SOURCE DF SEQ SS
attnnorm 1 87.11
spatnorm 1 425.57
Unusual Observations
Obs.attnnorm iqnorm Fit Stdev.Fit Residual St.Resid
4 29.0 92.000 101.128 1.182 9.128 2.08R
21 24.0 95.000 92.659 2.561 2.341 0.62 X
R denotes an obs. with a large st. resid.
X denotes an obs. whose X value gives it large influence.
Fit Stdev.Fit 95% C.I. 95% P.I.
101.007 0.844 ( 99.274,102.740) ( 91.528,110.486)
Coefficient for attention is 0.7479, and for spatial is 1.1822. Squared
multiple correlation is 0.479.
MTB > plot c10 c9
 *
 * *
6.0+
 * * *
C10  * *
 * * *
 * *
0.0+ * * *
 * ***
 * * *
 * *
 *
6.0+ * *
 *

 *

++++++C9
91.0 94.5 98.0 101.5 105.0 108.5
This looks like random scatter, indicating that the straightline fit was
probably appropriate. There is no sign of unequal error variances, and the
only point that seems to be an outlier is the one that Minitab flagged, with
fitted value of 101. The variability appears to be greater near the
middle of the data, but with so few data points it is difficult to detect
whether heteroscedasticity is a problem.
95% prediction interval is obtained from the predict subcommand above. The
fit is 101.007, and the 95% P.I. is ( 91.528,110.486).

5.
Prob #5
part a.
Below you will find a summary table of group indicator values for each of
the four different treatments.
Treatment G1 G2 G3
1 0 0 0
2 1 0 0
3 0 1 0
4 0 0 1
We know that the expected value for any given treatment is simply the
population mean for that treatment. Therefore,
E(YG1=0, G2=0, G3=0) = mu(1)
E(YG1=1, G2=0, G3=0) = mu(2)
E(YG1=0, G2=1, G3=0) = mu(3)
E(YG1=0, G2=0, G3=1) = mu(4)
Furthermore, we can represent the expected value for each treatment in
terms of general linear model parameters, by plugging the appropriate
values for G1, G2, and G3 into the given model. Namely,
E(YG1,G2,G3) = beta0 + beta1*G1 + beta2*G2 + beta3*G3
Treatment 1

E(YG1=0, G2=0, G3=0) = beta0 + beta1*0 + beta2*0 + beta3*0
= beta0
Treatment 2

E(YG1=1, G2=0, G3=0) = beta0 + beta1*1 + beta2*0 + beta3*0
= beta0 + beta1
Treatment 3

E(YG1=0, G2=1, G3=0) = beta0 + beta1*0 + beta2*1 + beta3*0
= beta0 + beta2
Treatment 4

E(YG1=0, G2=0, G3=1) = beta0 + beta1*0 + beta2*0 + beta3*1
= beta0 + beta3
Hence,
mu(1) = beta0
mu(2) = beta0 + beta1
mu(3) = beta0 + beta2
mu(4) = beta0 + beta3
Using the information that mu(1) = 7, mu(2) = 9, mu(3) = 6, mu(4) = 15, we
conclude:
beta0 = mu(1) = 7
beta1 = mu(2)  mu(1) = 9  7 = 2
beta2 = mu(3)  mu(1) = 6  7 = 1
beta3 = mu(4)  mu(1) = 15  7 = 8

part b.
mu(3)  mu(2)
= beta0 + beta2  (beta0 + beta1)
= beta2  beta1
= 1  2 = 3
Doublechecking these answers using our population cell means:
mu(3)  mu(2) = 6  9 = 3
It checks out.


PROBLEM 6
read 'salary.dat' c1c4
46 ROWS READ
ROW C1 C2 C3 C4
1 1 1 1 13876
2 1 3 0 11608
3 1 3 1 18701
4 1 2 0 11283
. . .
MTB > name c1 'exp' c2 'ed' c3 'man' c4 'salary'
note: to look at categorical variables, MTB table or tally commands
will give you relevant counts, relative frequencies etc. Kind of
the equivalent of describe command for measured variables.
MTB > table c2
ROWS: ed
COUNT
1 14
2 19
3 13
ALL 46
MTB > # We need to dummy code the education categories so as to represent
three different groups (not levels of an interval variable)
make HS the "base" category
MTB > code (1 3) 0 (2) 1 c2 c5;
> BS=1, Otherwise = 0.
MTB > name c5 'ed BS'
MTB > table c5
ROWS: ed BS
COUNT
0 27
1 19
ALL 46
MTB > code (1 2) 0 (3) 1 c2 c6
MTB > name c6 'ed Ad Deg'
>Adv = 1, 0 otherwise
MTB > name c6 'ed Adv'
To consider the single best predictor of salary, we look at
the correlations
MTB > corr c1 c3 c4
exp man
man 0.051
salary 0.539 0.730
From the above, Corr(salary, management) = 0.730
note that corr(salary, management) is a point
biserial correlation (measured variable, salary, correlated with a truly
dichotomous variable, managment or not).
And for the dummy variable, education, we look at
MTB > regress c4 2 c5 c6
The regression equation is
salary = 14942 + 3345 ed BS + 3351 ed Adv
Predictor Coef Stdev tratio
Constant 14942 1217 12.27
ed BS 3345 1604 2.09
ed Adv 3351 1754 1.91
s = 4554 Rsq = 10.9% Rsq(adj) = 6.8%
Analysis of Variance
SOURCE DF SS MS
Regression 2 109134646 54567323
Error 43 891962928 20743324
Total 45 1001097576
SOURCE DF SEQ SS
ed BS 1 33425874
ed Adv 1 75708771
Unusual Observations
Obs. ed BS salary Fit Stdev.Fit Residual St.Resid
42 1.00 27837 18286 1045 9551 2.15R
R denotes an obs. with a large st. resid.
The correlation here is the sqrt(0.109) = 0.330
Thus it looks like management status is the best predictor of salary,
at least among these variables.
Predicting salary from all three variables,
MTB > regress c4 4 c1 c3 c5 c6
The regression equation is
salary = 8036 + 546 exp + 6884 man + 3144 ed BS + 2996 ed Adv
Predictor Coef Stdev tratio
Constant 8035.6 386.7 20.78
exp 546.18 30.52 17.90
man 6883.5 313.9 21.93
ed BS 3144.0 362.0 8.69
ed Adv 2996.2 411.8 7.28
s = 1027 Rsq = 95.7% Rsq(adj) = 95.3%
Analysis of Variance
SOURCE DF SS MS
Regression 4 957816856 239454214
Error 41 43280720 1055627
Total 45 1001097576
And adding two interaction terms for management by education
MTB > let c7=c5*c3
MTB > let c8=c6*c3
MTB > name c7 'BS*man'
MTB > name c8 'Adv*man'
MTB > regress c4 6 c1 c3 c5 c6 c7 c8
The regression equation is
salary = 9473 + 497 exp + 3981 man + 1382 ed BS + 1731 ed Adv + 4903 BS*man
+ 3066 Adv*man
Predictor Coef Stdev tratio
Constant 9472.69 80.34 117.90
exp 496.987 5.566 89.28
man 3981.4 101.2 39.35
ed BS 1381.67 77.32 17.87
ed Adv 1730.7 105.3 16.43
BS*man 4902.5 131.4 37.32
Adv*man 3066.0 149.3 20.53
s = 173.8 Rsq = 99.9% Rsq(adj) = 99.9%
Analysis of Variance
SOURCE DF SS MS
Regression 6 999919408 166653234
Error 39 1178168 30209
Total 45 1001097576
The model with 4 predictors has Rsq = 0.957
The model with 6 predictors has Rsq = 0.999
The change in Rsq = 0.042
The test statistic to use is
F = [delta(Rsq)/# variables added]/[(1Rsq(full))/(N
#variables in full1)]
= (0.042/2) / (1  0.999)/(46  6 1)
= 819
The critical value is F(0.95, 2,39) = 3.24
So we conclude that the interaction terms add to the prediction
of salary significantly.
The regression of salary on experience, management and education is
MTB > regress c4 4 c1 c3 c5 c6
The regression equation is
salary = 8036 + 546 exp + 6884 man + 3144 ed BS + 2996 ed Adv
Predictor Coef Stdev tratio
Constant 8035.6 386.7 20.78
exp 546.18 30.52 17.90
man 6883.5 313.9 21.93
ed BS 3144.0 362.0 8.69
ed Adv 2996.2 411.8 7.28
s = 1027 Rsq = 95.7% Rsq(adj) = 95.3%
Analysis of Variance
SOURCE DF SS MS
Regression 4 957816856 239454214
Error 41 43280720 1055627
Total 45 1001097576
We answer the last two parts using the model without those interaction terms
which is basically an arbitrary choice of convenience. Would also be
reasonable to use the full models and calculate this increment separately
for managers and nonmanagers etc.
From the regression coefficient, we see that one more year of
experience is worth, on average, $546.18
A 95% confidence interval is thus given by
546.18 +/ t(0.975, 41)*30.52
===> (484.50, 607.86)
The "value" of an advanced degree is $2996.2, so
a 95% confidence interval for this estimate is
2996.2 +/ t(0.025, 41)*411.8
===> (2163.95, 3828.46)
Note; Value is calibrated according to the coding of the educational
groups. The phrasing of the final question "Repeat for an advanced
degree in addition to the BS" can be subject easily to a different
reading than I had intended. The comparison between advanced and H.S,
is what is done above (and what I had in mind).

Prob 7.
The prediction equation to work from is
S = 1900 + 230*B + 18*A + 100*E + 490*D + 190*Y + 50*T  2400*X
The coefficient of T in the regression equation is 50. Note that
salary is in an arbitrary 'funnymoney' unit. It tells us that
there will be a 50 unit decrease in salary for someone whose student
evaluations change from very good (T=1) to poor (T=0). This is true no
matter what the values of the other variables are (of course this conclusion
applies only to this model, with its set of predictors).
partial regression coefficient of T here could be interpreted as the
average change in salary between the two levels of student evaluation,
which is 50.

To do this problem the HARD way, you can plug in the values of the other
variables that the question provides for us (B=A=E=D=X=1 and Y=5) into
the regression equation:
S = 1900 + 230(1) + 18(1) + 100(1) + 490(1) + 190(5) + 50(T)  2400(1)
= 1288 + 50(T)
For a professor with good student evaluations (T = 1): S = 1288 + 50(1) = 1338
For a professor with bad student evaluations (T = 0): S = 1288
So the expected change in salary is 12881338 = 50. (That is, a
decrease of 50).

Part b.
model for the S on X regression is
E(SX) = beta(0) + beta(1)*X
To calculate the value of the estimate of the slope (beta(hat)(1)),
first look at the expected values of X within each group and realize
that they are equal to the means of S within each group.
sample value of
E(SX=0) = beta(hat)(0) = mu(hat)(0)
E(SX=1) = beta(hat)(0) + beta(hat)(1) = mu(hat)(1)
Where:
mu(hat)(0) = the mean salary for males = 16,100
mu(hat)(1) = the mean salary for females = 11,200
To calculate the slope, beta(hat)(1), we subtract the two group
means:
beta(hat)(1) = mu(hat)(1)  mu(hat)(0) = 11,200  16,100 = 4900
Another way to do this problem is to realize that the line for the
regression of S on X will pass through the points (0, 16100) and
(1, 11200).
The slope = (change in S)/(change in X) = (16100 11200)/(0  1) = 4900.

PROBLEM 8
To calculate the slope for the C1 on C2 regression for the incentive
group, we examine the general model:
E(C1C2,C3) = beta(0) + beta(1)*C3 + beta(2)*C2 +
beta(3)*(C2*C3)
We have the results of this regression in the output for the regression
of C1 on C2, C3, and C4 (i.e., C2*C3).
>From substituting 0 and 1 for C3, we know that:
E(C1C2, C3=0) = beta(0) + beta(2)*C2
E(C1C2, C3=1) = [beta(0) + beta(1)]
+ [beta(2) + beta(3)]*C2
Thus the sample value of the slope for the incentive group (C3=1) is
beta(hat)(2) + beta(hat)(3).
slope = .267 + (.06) = .207
To calculate the correlation between C1 and C2 for the incentive group,
we use the equation:
slope(C1 on C2) = r(C1,C2)*(s(C1)/s(C2))
You've seen this equation written as: b(yx) = r(xy)*(s(y)/s(x)).
The question tells us that the standard deviations of C1 and C2 for the
incentive group are 1.782 and 5.84, respectively. Solving for r, we
get:
.207 = r(C1,C2)*(1.782/5.84)
or r(C1,C2) = .207*(5.84/1.782) = .678
Treatment Effect Confidence Interval
There are a couple of ways to do this problem.
(i)
The treatment effect is represented by the coefficient of C3 (the group
membership variable) in the regression of C1 on C3 and C2. However, it
is blocked out in the output and so all we know is the standard error of
the treatment effect (.4926).
However, we also know that the treatment effect is the difference
between the adjusted means in the ancova output. This is
9.9167  5.8333 = 4.0834.
A 99% confidence interval is
(4.0834) +/ (.4926)(t(.995, df = #obs#parameters)
In NWK we look up t(.995, 21) = 2.831.
The 99% confidence interval has endpoints 4.0834 +/ 1.395 or
(2.689, 5.478).
(ii)
Another way to calculate the value of the treatment effect is to notice
that the Fvalue for C3 in the ancova output is the square of the
tratio for C3 in the regression of C1 on C3 and C2.
F = MS(C3)/MS(Error) = 100.042/1.456 = 68.71
t = (68.71)**.5 = 8.29
Since we know the standard error of the coefficient of C3 in the
regression of C1 on C3 and C2 and its tratio, we can solve for the
coefficient:
coefficient of C3 = treatment effect = 8.29*.4926 = 4.0836
Now we can calculate the confidence interval as we had done above.
(iii) use formula for adjusted mean difference:
Ybar(1)  Ybar(0)  betahatpooled( Xbar(1)  Xbar(0))
betahatpooled, pooled slope from ancova routine printout is .2367.
Ybar() Xbar(), withingroup means on
outcome and covariate from describe. Substitute values.
PART II Comparing Nonparallel Regesssion Lines Analysis
MTB > read 'retention.dat' c1c3
24 ROWS READ
ROW C1 C2 C3
1 3 5 0
2 4 5 0
3 5 5 0
4 4 10 0
. . .
Do a simple description. Note covariate means are identical.
Should be close in random assignment of subjects.
MTB > describe c1c2;
SUBC> by c3.
C3 N MEAN MEDIAN TRMEAN STDEV SEMEAN
C1 0 12 5.833 5.500 5.800 1.850 0.534
1 12 9.917 10.000 9.900 1.782 0.514
C2 0 12 12.50 12.50 12.50 5.84 1.69
1 12 12.50 12.50 12.50 5.84 1.69
C3 MIN MAX Q1 Q3
C1 0 3.000 9.000 4.250 7.750
1 7.000 13.000 8.250 11.000
C2 0 5.00 20.00 6.25 18.75
1 5.00 20.00 6.25 18.75
Form interaction term for regression
MTB > let c4 = c2*c3
Do the general regression allowing different within group slopes
The subcommands pick up useful quantities for the procedures below.
Note within group slopes are nearly identical. Therefore treatment
effect is only weakly a function of the covariate.
MTB > brief 3
MTB > regress c1 3 c3 c2 c4;
SUBC> mse k1;
SUBC> coefficients c10;
SUBC> xpxinv m1.
The regression equation is
C1 = 2.50 + 4.83 C3 + 0.267 C2  0.0600 C4
Predictor Coef Stdev tratio p
Constant 2.5000 0.8646 2.89 0.009
C3 4.833 1.223 3.95 0.001
C2 0.26667 0.06314 4.22 0.000
C4 0.06000 0.08929 0.67 0.509
s = 1.223 Rsq = 82.7% Rsq(adj) = 80.1%
Analysis of Variance
SOURCE DF SS MS F p
Regression 3 142.725 47.575 31.82 0.000
Error 20 29.900 1.495
Total 23 172.625
SOURCE DF SEQ SS
C3 1 100.042
C2 1 42.008
C4 1 0.675
Obs. C3 C1 Fit Stdev.Fit Residual St.Resid
1 0.00 3.000 3.833 0.591 0.833 0.78
2 0.00 4.000 3.833 0.591 0.167 0.16
3 0.00 5.000 3.833 0.591 1.167 1.09
4 0.00 4.000 5.167 0.387 1.167 1.01
5 0.00 5.000 5.167 0.387 0.167 0.14
6 0.00 6.000 5.167 0.387 0.833 0.72
7 0.00 5.000 6.500 0.387 1.500 1.29
8 0.00 6.000 6.500 0.387 0.500 0.43
9 0.00 8.000 6.500 0.387 1.500 1.29
10 0.00 7.000 7.833 0.591 0.833 0.78
11 0.00 8.000 7.833 0.591 0.167 0.16
12 0.00 9.000 7.833 0.591 1.167 1.09
13 1.00 7.000 8.367 0.591 1.367 1.28
14 1.00 8.000 8.367 0.591 0.367 0.34
15 1.00 9.000 8.367 0.591 0.633 0.59
16 1.00 9.000 9.400 0.387 0.400 0.34
17 1.00 10.000 9.400 0.387 0.600 0.52
18 1.00 11.000 9.400 0.387 1.600 1.38
19 1.00 8.000 10.433 0.387 2.433 2.10R
20 1.00 11.000 10.433 0.387 0.567 0.49
21 1.00 12.000 10.433 0.387 1.567 1.35
22 1.00 10.000 11.467 0.591 1.467 1.37
23 1.00 11.000 11.467 0.591 0.467 0.44
24 1.00 13.000 11.467 0.591 1.533 1.43
R denotes an obs. with a large st. resid.
MTB > print k1
K1 1.49500
Here's the variancecovariance matrix for the betahat
MTB > multiply k1 m1 m2
MTB > print m2
MATRIX M2
0.74750 0.74750 0.04983 0.04983
0.74750 1.49500 0.04983 0.09967
0.04983 0.04983 0.00399 0.00399
0.04983 0.09967 0.00399 0.00797
Now D(X), the sample estimate of the treatment effect as a function of the
covariate is
D(X) = 4.83  0.0600*X
which you can see does not strongly change for the various values of X
(which ranges between 5 and 20). (4.5 to 3.6)

Inferences for D(X)
Pickapoint for X=12.5
(note that this also corresponds to ancova estimate as
Ca = 12.5 e.g. .09967/.00797)
I di the arithmetic and substitutions in Mathematica
(rather than a calculator) so I'd a record to include
in the solutions (Here d[X] is our D(X) etc)
In[3]:= d[X]
Out[3]= 4.833  0.06 X
In[13]:= d[12.5]
Out[13]= 4.083
In[4]:= vard[X]
2
Out[4]= 0.248562 + 0.00797 (12.5056 + X)
In[14]:= Sqrt[vard[12.5]]
Out[14]= 0.49856
So 99% confidence interval for Delta(12.5) has endpoints
4.083 +/ 2.845*.49856
4.083 +/ 1.418

Construct Simultaneous confidence interval for Delta(X)
Confidence bands have form D(X) +/ Sqrt[2*fCrit220*varD[X]]
In[4]:= vard[X]
2
Out[4]= 0.248562 + 0.00797 (12.5056 + X)
In[7]:= fCrit220 = 3.49
Out[7]= 3.49
I used these table command to get the values of the
confidence functions for X values from 0 to 30.
In[9]:= Table[{ix, d[ix] + Sqrt[2*fCrit220*vard[ix]]}, {ix, 0, 30, 5.}]
In[11]:= Table[{ix, d[ix]  Sqrt[2*fCrit220*vard[ix]]}, {ix, 0, 30, 5.}]
X D(X) Lower CI Upper CI
0 4.833 1.60266 8.06334
5. 4.533 2.32644 6.73956
10. 4.233 2.78931 5.67669
15. 3.933 2.4904 5.3756
20. 3.633 1.42858 5.83742
25. 3.333 0.105091 6.56091
30. 3.033 1.29838 7.36438
Throughout the range of these data the confidence bands for Delta(X)
do not include 0. I.e. all observed values of X (5 to 20) are in the
Region of Significance.
As we saw from part I ancova properly reports a strong treatment effect
For reference the ancova routine for these data
MTB > ancova c1 = c3;
SUBC> covariates c2.
Factor Levels Values
C3 2 0 1
Analysis of Covariance for C1
Source DF ADJ SS MS F P
Covariates 1 42.008 42.008 28.85 0.000
C3 1 100.042 100.042 68.71 0.000
Error 21 30.575 1.456
Total 23 172.625
Covariate Coeff Stdev tvalue P
C2 0.2367 0.0441 5.371 0.000


END HW3 solutions