Education 257A  FINAL PROBLEMS, Winter 2005, 
      March 10, 2005

Solutions for these problems are to be submitted in hard-copy
form. Given that these problems are untimed, some care should be
taken in presentation, clarity, format.  Especially important is
to give full and clear answers to questions, not just to submit
unannotated computer output, although relevant output should
be included.
You may use any inanimate resources--no collaboration.  This
work is done under Stanford's Honor Code.
Please read the questions carefully and answer the question that
is asked.  
Papers will be scored into 3 categories: "Excellent" indicates
successful completion of all parts of all questions (within
perhaps one or two very trivial arithmetic errors);
"Satisfactory" indicates a good attempt was made at all parts of
all problems, but there were some serious errors or omissions;
"Incomplete" indicates inadequate effort or performance. 

Place completed hard copy in Rogosa's Cubberley or Sequoia Hall mailbox 
by 5PM Friday 3/18

Data sets: I took the extra effort to link data directly
from this assignment document
data reside in the class HW directory 
URL is http://statistics.stanford.edu/~rag/ed257/hw/[file]


=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=
Problem 1

A few years back a large national study called High School and
Beyond (HSB) was undertaken.  We consider here a small subset
of the data gathered.

THE HIGH SCHOOL AND BEYOND DATA SET
A nationally representative sample of 200 high school seniors
contained in file 
hsb.dat
in the course HW directory.
hsb.dat contains 10 columns:
c1: SEX    1=MALE    2=FEMALE
c2: RACE   1=HISPANIC 2=ASIAN 3=BLACK 4=WHITE
c3:  SES   1=LOWER   2=MIDDLE    3=UPPER
   (socio-economic status)
c4: SCHOOL TYPE   1=PUBLIC    2=PRIVATE               
c5: HIGH SCHOOL PROGRAM   1=GENERAL  2=ACADEMIC   3=VOCATIONAL
c6:   RDG       READING T-SCORE             
c7:   WRTG      WRITING T-SCORE             
c8:   MATH      MATH T-SCORE                
c9:   SCI       SCIENCE T-SCORE             
c10:  CIV       CIVICS T-SCORE 

The three parts below ask you to carry out an assortment of
statistical tasks

Part 1   
a) Obtain mean and variance of the science test score for each 
   type of high school program 
   Carry out an anova for this one-way classification (3 levels)
   Test the omnibus null hypothesis of no differences between the 
   group means using Type I error rate .01.
b) Obtain the power of test in part a; assume the population
   group means are the rounded-to-the-nearest integer values of the 
   sample means and that the experimental error variance (sigma)^2 is 
   the rounded-to-the-nearest integer value of MSW.
--------------
Part 2
Use c7 (writing score) as the outcome measure.
Consider a two-way cross-classification defined by gender in c1
crossed with private vs public school-type in c4 to create a 2x2
design. 
a. Obtain cell sizes, cell means and construct profile plot
   for this two factor design
b. Carry out a two-way anova using glm
   Test main effects and interaction
   Give conclusions, keep overall Type I error
   rate <= .05
c. compare the results from the unweighted means (e.g. Miller text) 
   procedure with the results from part b.
--------------
Part 3
a. now add SES at two levels-- middle c3=2 and upper c3=3 SES-- as a 
   third factor in this cross-classification to create a 2x2x2 design
   (gender X school-type X SES(middle/upper). Note: we are setting
   aside the cases with lower SES (c3 = 1).
   Obtain cell sizes, cell means and construct profile plots for 
   this three factor design
   Carry out a three-way anova using glm
   Test main effects and interactions.
   Give conclusions, keep overall Type I error rate <= .10.

=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=

Problem 2.

Can pleasant aromas help a student learn better? 

Hirsch and Johnston, of the Smell & Taste Treatment and Research 
Foundation, believe that the presence of a floral scent can improve a 
person's learning ability in certain situations. In their experiment, 
22 people worked through a set of two pencil and paper mazes six 
times, three times while wearing a floral-scented mask and three times 
wearing an unscented mask. Individuals were randomly assigned to wear 
the floral mask on either their first three tries or their last three 
tries. Participants put on their masks one minute before starting the 
first trial in each group to minimize any distracting effect. Subjects 
recorded whether they found the scent inherently positive, inherently 
negative, or if they were indifferent to it. Testers measured the 
length of time it took subjects to complete each of the six trials.

In file  scent.dat
C1 ID:
C2 Sex: M=male, F=female
C3 Age: Age in years
C4 Smoker: Y if subject smoked, N if did not
C5 Opinion: "pos" if subject found the odor inherently positive, 
            "indiff" if indifferent, "neg" if inherently negative
C6 Order: 1 if did unscented trials first, 2 if did scented trials first
C7 U-Trial 1: length of time required for first unscented trial
C8 U-Trial 2 : length of time required for second unscented trial
C9 U-Trial 3: length of time required for third unscented trial
C10 S-Trial 1 : length of time required for first scented trial
C11 S-Trial 2 : length of time required for second scented trial
C12 S-Trial 3: length of time required for third scented trial 

There are various structures of these data to investigate in 
understanding the possible effect of aroma. Give some thought and 
effort to find a good analysis for these (repeated measures) 
data. One way to use the outcome measures is to look at 
improvement, which could be expressed as the percentage change in 
speed of completion from the first trial to the third trial for 
each maze. Are there better approaches?

=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=
Problem 3

Worth it at any price?  

We have data on Educational expenditures for 35 Massachusetts
towns in the early 1970's; data are a subset from M. S. Feldstein
(American Economic Review, March 1975). 

File school.dat school.dat
has columns: 
Educational Expenditure per public school Pupil (EEP) is in c5; 
Median Family Income (MFI) is in c1; 
percent of local tax base in Residential property (RES) is in c2; 
Public school Students per Capita (PSC) is in c3;
Taxable property Value per public school Pupil (TVP) is in c4. We
are interested in using these data to describe factors that
predict EEP.

a. Carry out a multiple regression fit predicting EEP from all 4 
possible predictors (c1-c4). Is collinearity a concern for these 
data in fitting this prediction equation? Explain (cite evidence) 
briefly.

b. For a town with MFI = 9756, RES = 29, PSC = .181, TVP= 22569,
construct a 99% confidence interval for expected EEP.

c. To get a closer look for diagnostics etc in the multiple
regression in part (c), construct the partial regression plot for
the coefficient of TVP in the prediction of EEP. (Hint: In NWK notation
this plot has on the vertical axis e(EEP | MFI, RES, PSC) and on
the horizontal axis e(TVP | MFI, RES, PSC). Give the equation of
a straight-line fit, using least-squares, to the points in that
plot.

d. Carry out a test with Type I error rate .01 of the null
hypothesis that the coefficients of MFI, RES, and PSC are all
zero in the regression model of part (c) against the alternative
that not all are zero.

=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=
Problem 4

 Categorical Predictor Variables in both Non-experimental
and Experimental Settings.

Part 1--Nonexperimental Data
For our class example of the of the NELS data (n=4804) [data nels.dat linked on 
class examples web-page] taken from National Educational Longitudinal Study of 1988 (NELS:88)); 
(see description in course files and examples ), let's examine the following.
Obtain a prediction equation for students' 10th-grade scores on the science 
achievement test using the predictors: 8th grade science achievement and
socio-economic status. (indicator variables define 3 levels of SES; lowest
quartile, middle SES, highest quartile).  Employ the indicator variables to
obtain a prediction equation for the outcome that allows for different
10th grade on 8th grade regression lines for the three SES groups. 
Now consider a student with a score of 20 on the 8th grade test. 
From the previous regression equation, estimate what the difference in his 
predicted outcome would be if he were high SES versus if he were middle SES.
Conduct a statistical test of the null hypothesis that the 10th grade on 8th 
grade regression slopes are identical for the three SES groups.

extra credit: are there significant gender differences in the prediction of 
10th-grade science by eighth-grade science and SES?
---------------------------------------------------

Part 2. Experimental Data
 SLEEP Looking forward to getting more?   

A simple experiment compared the effectiveness of two sedatives
in promoting length of sleep, labelled here as Drug A and Drug B.
Two groups of size 10 were formed by random assignment (in c3
group membership is coded A = 1, B =2) . The outcome measure in
c1 is the number of hours of sleep obtained by the subject after
taking the drug; the covariate in c2 is the number of hours of
sleep obtained by the subject normally with no medication.
Data reside in file  sleep.dat
------------

a. Construct a 90% confidence interval for difference of
group means on the outcome measure in c1 (i.e. do not use covariate c2 
information).

b. Now consider use of covariate information in c2. 
What are the sample within-group c2-on-c1 slopes? 
Carry out a preliminary test of the ancova assumption of equal c2 on c1
slopes in each group with Type I error rate .10.  

c. Obtain a point and interval estimate for the 
analysis of covariance treatment effect.  Use confidence coefficient .90.
Compare the width of this confidence interval with part (a).  
Did use of the covariate help in the estimation? Comment.


=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-= 

=-=-=-=-=-=-=-=-=-=
END