Education 257 HW3 Feb 15 2005 ("Due" March 7 2005)
Part I. Muliple Regression
1. The file 'hospital.dat'
contains data on days hospitalized (X in C1) and
a prognosis index (Y in C2) for 15 severely injured patients.
A hospital administrator wants to develop a prediction equation
for the long term prognosis using the length of the hospital stay.
(a) Develop a prediction equation by straightening the
scatterplot and using a straight-line fit.
Give the fit and an interval estimate for
a patient hospitalized 10 days.
Repeat for 60 days hospitalization.
(b) For this same problem develop a prediction equation for the
long term prognosis by fitting a polynomial.
Compare the fits and a interval estimate for expected prognosis
for a patient hospitalized 10 days from the
two approaches-- polynomial fit vs straightening the scatterplot
and using a straight-line fit in part a. Repeat the comparison for
60 days hospitalization.
===============================================================
2.
Bodyfat data revisited
By referring to file bodyfat.out and/or to the output in
NWK (or by redoing the analyses), let's use this
example to once-again illustrate the vagaries of multiple
regression coefficients (and improper attempts to interpret
them).
Which of the three predictors--triceps X1, thigh X2 or
midarm X3-- is the best single predictor of bodyfat?
What is the regression coefficient for that predictor in
a single predictor eqaution? What is the corresponding
t-statistic for that coefficient?
Now consider the regression using both triceps and thigh as
predictors. Compare the coefficients (and their t-statistics)
from this multiple regression with the corresponding single
predictor equations.
Now consider the multiple regression using all three predictors.
For triceps and thigh, compare the coefficients (and their
t-statistics) from this multiple regression with the results from
the previous regression equations. To decrease bodyfat does one puff
up one's thighs?
------------------------------------------------------------
3. Patient Satisfaction Data are described and listed in
Problem NWK 6.15, p.254-5 ver4 (p.251 ver 5).
The data reside in file patient.dat
From NWK 6.15: A hospital adminstrator wished to study the relation
between patient satisfaction Y (in C1) and X1 patients age (in C2),
X2 an index of severity of illness (in C3), and X3
anxiety level (in c4) where larger values of Y X2 X3 indicate
more satisfaction, more severe illness and more anxiety.
Do the following parts of problems:
6.15 a c d
6.16 a b c
6.17 a
9.7 p.393 ver4 or 10.7 p.414 ver5, parts a b
also for the fit from 6.15c verify that the regression
coefficients can be obtained from straight line fits to
the corresponding partial regression plots. Use the
coefficient for X2 as your example.
--------------------------------------------------------------
4.
IQ scores and reading ability
The file readiq.dat contains data (from a text) on 60
elementary school boys, 30 of whom were rated as poor or
very poor readers--at least 2 years below grade level. The
remaining 30 boys read normally, but otherwise resembled
the poor readers in terms of schools, age, family background,
and other variables. The 30 boys with reading problems
consisted of 11 "very poor" readers and 19 who were merely "poor"
readers. In the data file c1= 1 for very poor; c1 = 2 for poor;
c1 = 3 for normal.
The relation of reading disability to IQ measures is currently seen
not to be as simple as "poor readers have lower intelligence".
We have in column c4 the full-scale WISC-R IQ score. In c2 we
have the attention/concentration sub-scale score (composed of
arithmetic, digit-span, coding subtests). In c3 we have the
spatial ability sub-scale score (composed of picture completion,
block design, object assembly subtests).
a) Obtain a scatterplot
for the attention/concentration and spatial ability scores
with the reading ability level (1,2,3) in c1 used to identify
each individual (e.g. c1 = 1 gets an "A" label etc)
b)
For the normal readers, use the subscale scores in c2 and
c3 to form a prediction equation for the full-scale WISC-R
scores in c4. What are the coefficients and squared multiple
correlation for this regression fit? Plot the residuals
versus the fits for this regression. Obtain a 95% prediction
interval for the full-scale score for an individual having
attention/concentration score of 32 and spatial ability score
of 30.
----------------------------------------------------------------
Part II HW3 after next lecture cycle (2/28-)
Regression with Group Membership Variables
------------------------------------------
5. Consider a one-way classification with four levels (I = 4).
We are given the population cell means (mu(1) through mu(4))
as: 7, 9, 6, 15.
Consider the general linear model setup (with 3 group membership
indicators)
E(Y|G1,G2,G3) = beta0 + beta1*G1 + beta2*G2 + beta3*G3
where
G1 = 1 if treatment 2 G1 = 0 otherwise
G2 = 1 if treatment 3 G2 = 0 otherwise
G3 = 1 if treatment 4 G3 = 0 otherwise
a. Determine the values for the 4 betas in the regression model
b. Express mu(3) - mu(2) in terms of the betas. Check by numerical
substitution.
---------------------------------------------------------------
6. File salary.dat contains data from a salary survey discussed
in lecture: C1 is experience,
c2 is education level (1 for HS, 2 for BS, 3 for advanced degree),
c3 indicate management position (=1) or not,
and c4 is the outcome measure salary.
First, code the 3 levels of education using 2 group membership
indicators (so that education is not used as an interval scale).
In the solutions we use HS as the base --0 0 code.
What is the single best predictor of salary?
Predict salary using experience, education, and management.
Add to the model two management-education interaction terms. Do
these terms add significantly to the prediction?
Give an interval estimate of the value of an additional year of
experience.
Repeat for an advanced degree in addition to the BS--
(i.e comparison asked for here is
the comparison between advanced and H.S, *not* to indicate I want
a differential between advanced deg and B.S. That's a harder thing
to do in this coding although it can be done)
--------------------------------------------------------------------
7. (former quiz question)
A study of several hundred professors' salaries in a large
American university in 1969 (AER, 1973, p.469) yielded the following
prediction equation: S = 1900 + 230*B + 18*A + 100*E + 490*D + 190*Y
+ 50*T - 2400*X where S is annual salary, B is number of books
written, A number of ordinary articles, E number of excellent
articles, D number of Ph.D.'s supervised, Y years experience, T = 1
if student evaluations above median, 0 otherwise, X = 1 if female, 0
otherwise.
For a prof with B=A=E=D=X=1 and Y=5, what's the
expected change in salary if she goes from very good to poor student
evaluations?
Mean salaries were $16,100 for males and $11,200 for females.
What is the value of the slope from a simple S on X regression?
-------------------------------------------------------------------
Analysis of Covariance and Extension
--------------------------------------------------------------------
8. A researcher is studying the effect of an incentive on the
retention of subject matter and is also interested in the role of
time devoted to study. Subjects are randomly assigned to two groups,
one receiving (C3 = 1) and the other not receiving (C3 = 0) an
incentive. Within these groups, subjects are randomly assigned to 5,
10, 15, or 20 minutes of study (C2) of a passage specifically
prepared for the experiment. At the end of the study period, a test
of retention (C1) is administered. We treat the study time as a
covariate for investigating the differential effects of the
incentive.
Part I: ANCOVA
Use the Minitab output below to answer the following questions.
(This is a quiz question from prior year)
(for reference raw data are in file retention.dat)
What is the slope of the C1 on C2
regression line for the 12 subjects in the incentive group?
What is the correlation between C1 and C2 for the incentive group?
Construct a 99% confidence interval for the analysis of covariance treatment
effect.
MTB > ancova c1 = c3;
SUBC> covariates c2;
SUBC> means c3.
Analysis of Covariance for C1
Source DF ADJ SS MS
Covariates 1 42.008 42.008
C3 1 100.042 100.042
Error 21 30.575 1.456
Total 23 172.625
Covariate Coeff Stdev t-value
C2 0.2367 0.0441 5.371
ADJUSTED MEANS
C3 N C1
0 12 5.8333
1 12 9.9167
MTB > describe c1-c2;
SUBC> by c3.
C3 N MEAN MEDIAN STDEV
C1 0 12 5.833 5.500 1.850
1 12 9.917 10.000 1.782
C2 0 12 12.50 12.50 5.84
1 12 12.50 12.50 5.84
MTB > let c4 = c2*c3
MTB > regress c1 3 c3 c2 c4
The regression equation is
C1 = 2.50 + 4.83 C3 + 0.267 C2
- 0.0600 C4
Predictor Coef Stdev
Constant 2.5000 0.8646
C3 4.833 1.223
C2 0.26667 0.06314
C4 -0.06000 0.08929
MTB > regress c1 2 c3 c2
The regression equation is
C1 = 2.87 + ???? C3 + ????? C2
Predictor Coef Stdev
Constant 2.8750 0.6517
C3 ?????? 0.4926
C2 ??????? 0.04406
----------------------------------------
Part II CNRL analysis (optional, more next cycle)
Now let's look at these data from scratch.
The full data are in file retention.dat (as described above)
Carry out a full comparing nonparallel regression lines analysis.
CNRL paper linked on course outline
Obtain a 99% confidence interval for the effect of the incentive for
12.5 minutes of study. ("pick-a-point" procedure)
Obtain a 95% simultaneous interval for the effect of the incentive over
the entire range of study times. (simultaneous J-N procedure)
================================
END HW3