Education 257A FINAL PROBLEMS, Winter 2005, March 10, 2005 Solutions for these problems are to be submitted in hard-copy form. Given that these problems are untimed, some care should be taken in presentation, clarity, format. Especially important is to give full and clear answers to questions, not just to submit unannotated computer output, although relevant output should be included. You may use any inanimate resources--no collaboration. This work is done under Stanford's Honor Code. Please read the questions carefully and answer the question that is asked. Papers will be scored into 3 categories: "Excellent" indicates successful completion of all parts of all questions (within perhaps one or two very trivial arithmetic errors); "Satisfactory" indicates a good attempt was made at all parts of all problems, but there were some serious errors or omissions; "Incomplete" indicates inadequate effort or performance. Place completed hard copy in Rogosa's Cubberley or Sequoia Hall mailbox by 5PM Friday 3/18 Data sets: I took the extra effort to link data directly from this assignment document data reside in the class HW directory URL is http://statistics.stanford.edu/~rag/ed257/hw/[file] =-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-= Problem 1 A few years back a large national study called High School and Beyond (HSB) was undertaken. We consider here a small subset of the data gathered. THE HIGH SCHOOL AND BEYOND DATA SET A nationally representative sample of 200 high school seniors contained in file hsb.dat in the course HW directory. hsb.dat contains 10 columns: c1: SEX 1=MALE 2=FEMALE c2: RACE 1=HISPANIC 2=ASIAN 3=BLACK 4=WHITE c3: SES 1=LOWER 2=MIDDLE 3=UPPER (socio-economic status) c4: SCHOOL TYPE 1=PUBLIC 2=PRIVATE c5: HIGH SCHOOL PROGRAM 1=GENERAL 2=ACADEMIC 3=VOCATIONAL c6: RDG READING T-SCORE c7: WRTG WRITING T-SCORE c8: MATH MATH T-SCORE c9: SCI SCIENCE T-SCORE c10: CIV CIVICS T-SCORE The three parts below ask you to carry out an assortment of statistical tasks Part 1 a) Obtain mean and variance of the science test score for each type of high school program Carry out an anova for this one-way classification (3 levels) Test the omnibus null hypothesis of no differences between the group means using Type I error rate .01. b) Obtain the power of test in part a; assume the population group means are the rounded-to-the-nearest integer values of the sample means and that the experimental error variance (sigma)^2 is the rounded-to-the-nearest integer value of MSW. -------------- Part 2 Use c7 (writing score) as the outcome measure. Consider a two-way cross-classification defined by gender in c1 crossed with private vs public school-type in c4 to create a 2x2 design. a. Obtain cell sizes, cell means and construct profile plot for this two factor design b. Carry out a two-way anova using glm Test main effects and interaction Give conclusions, keep overall Type I error rate <= .05 c. compare the results from the unweighted means (e.g. Miller text) procedure with the results from part b. -------------- Part 3 a. now add SES at two levels-- middle c3=2 and upper c3=3 SES-- as a third factor in this cross-classification to create a 2x2x2 design (gender X school-type X SES(middle/upper). Note: we are setting aside the cases with lower SES (c3 = 1). Obtain cell sizes, cell means and construct profile plots for this three factor design Carry out a three-way anova using glm Test main effects and interactions. Give conclusions, keep overall Type I error rate <= .10. =-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-= Problem 2. Can pleasant aromas help a student learn better? Hirsch and Johnston, of the Smell & Taste Treatment and Research Foundation, believe that the presence of a floral scent can improve a person's learning ability in certain situations. In their experiment, 22 people worked through a set of two pencil and paper mazes six times, three times while wearing a floral-scented mask and three times wearing an unscented mask. Individuals were randomly assigned to wear the floral mask on either their first three tries or their last three tries. Participants put on their masks one minute before starting the first trial in each group to minimize any distracting effect. Subjects recorded whether they found the scent inherently positive, inherently negative, or if they were indifferent to it. Testers measured the length of time it took subjects to complete each of the six trials. In file scent.dat C1 ID: C2 Sex: M=male, F=female C3 Age: Age in years C4 Smoker: Y if subject smoked, N if did not C5 Opinion: "pos" if subject found the odor inherently positive, "indiff" if indifferent, "neg" if inherently negative C6 Order: 1 if did unscented trials first, 2 if did scented trials first C7 U-Trial 1: length of time required for first unscented trial C8 U-Trial 2 : length of time required for second unscented trial C9 U-Trial 3: length of time required for third unscented trial C10 S-Trial 1 : length of time required for first scented trial C11 S-Trial 2 : length of time required for second scented trial C12 S-Trial 3: length of time required for third scented trial There are various structures of these data to investigate in understanding the possible effect of aroma. Give some thought and effort to find a good analysis for these (repeated measures) data. One way to use the outcome measures is to look at improvement, which could be expressed as the percentage change in speed of completion from the first trial to the third trial for each maze. Are there better approaches? =-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-= Problem 3 Worth it at any price? We have data on Educational expenditures for 35 Massachusetts towns in the early 1970's; data are a subset from M. S. Feldstein (American Economic Review, March 1975). File school.dat school.dat has columns: Educational Expenditure per public school Pupil (EEP) is in c5; Median Family Income (MFI) is in c1; percent of local tax base in Residential property (RES) is in c2; Public school Students per Capita (PSC) is in c3; Taxable property Value per public school Pupil (TVP) is in c4. We are interested in using these data to describe factors that predict EEP. a. Carry out a multiple regression fit predicting EEP from all 4 possible predictors (c1-c4). Is collinearity a concern for these data in fitting this prediction equation? Explain (cite evidence) briefly. b. For a town with MFI = 9756, RES = 29, PSC = .181, TVP= 22569, construct a 99% confidence interval for expected EEP. c. To get a closer look for diagnostics etc in the multiple regression in part (c), construct the partial regression plot for the coefficient of TVP in the prediction of EEP. (Hint: In NWK notation this plot has on the vertical axis e(EEP | MFI, RES, PSC) and on the horizontal axis e(TVP | MFI, RES, PSC). Give the equation of a straight-line fit, using least-squares, to the points in that plot. d. Carry out a test with Type I error rate .01 of the null hypothesis that the coefficients of MFI, RES, and PSC are all zero in the regression model of part (c) against the alternative that not all are zero. =-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-= Problem 4 Categorical Predictor Variables in both Non-experimental and Experimental Settings. Part 1--Nonexperimental Data For our class example of the of the NELS data (n=4804) [data nels.dat linked on class examples web-page] taken from National Educational Longitudinal Study of 1988 (NELS:88)); (see description in course files and examples ), let's examine the following. Obtain a prediction equation for students' 10th-grade scores on the science achievement test using the predictors: 8th grade science achievement and socio-economic status. (indicator variables define 3 levels of SES; lowest quartile, middle SES, highest quartile). Employ the indicator variables to obtain a prediction equation for the outcome that allows for different 10th grade on 8th grade regression lines for the three SES groups. Now consider a student with a score of 20 on the 8th grade test. From the previous regression equation, estimate what the difference in his predicted outcome would be if he were high SES versus if he were middle SES. Conduct a statistical test of the null hypothesis that the 10th grade on 8th grade regression slopes are identical for the three SES groups. extra credit: are there significant gender differences in the prediction of 10th-grade science by eighth-grade science and SES? --------------------------------------------------- Part 2. Experimental Data SLEEP Looking forward to getting more? A simple experiment compared the effectiveness of two sedatives in promoting length of sleep, labelled here as Drug A and Drug B. Two groups of size 10 were formed by random assignment (in c3 group membership is coded A = 1, B =2) . The outcome measure in c1 is the number of hours of sleep obtained by the subject after taking the drug; the covariate in c2 is the number of hours of sleep obtained by the subject normally with no medication. Data reside in file sleep.dat ------------ a. Construct a 90% confidence interval for difference of group means on the outcome measure in c1 (i.e. do not use covariate c2 information). b. Now consider use of covariate information in c2. What are the sample within-group c2-on-c1 slopes? Carry out a preliminary test of the ancova assumption of equal c2 on c1 slopes in each group with Type I error rate .10. c. Obtain a point and interval estimate for the analysis of covariance treatment effect. Use confidence coefficient .90. Compare the width of this confidence interval with part (a). Did use of the covariate help in the estimation? Comment. =-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-= =-=-=-=-=-=-=-=-=-= END