Education 257 FINAL PROBLEMS, Spring 2005 JUNE 1, 2005 Solutions for these problems are to be submitted in hard-copy form. Given that these problems are untimed, some care should be taken in presentation, clarity, format. Especially important is to give full and clear answers to questions, not just to submit unannotated computer output, although relevant output should be included. You may use any inanimate resources--no collaboration. This work is done under Stanford's Honor Code. Please read the questions carefully and answer the question that is asked. Papers will be scored into 3 categories: "Excellent" indicates successful completion of all parts of all questions (within perhaps one or two very trivial arithmetic errors); "Satisfactory" indicates a good attempt was made at all parts of all problems, but there were some serious errors or omissions; "Incomplete" indicates inadequate effort or performance. Place completed hard copy in Rogosa's Cubberley or Sequoia Hall mailbox by 5PM Friday 6/10 Data sets: I took the extra effort to link data directly from this assignment document data reside in the class HW directory URL is http://statistics.stanford.edu/~rag/ed257/hw/[file] One loose end: Course Evaluations for students not in the School of Education. I will place a few blank forms in my Sequioa mailbox and these can be returned to the Regisrrar's office, if you have not already done one. =-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-= Problem 1, Model Building, Variable Selection Can anyone do math? Was it your Parents fault? Relation of educational achievement of students to the home environment. Data on average mathematics proficiency (MATHPROF) and the home environment variables were obtained from the 1990 National Assessment of Educational Progress for 37 states, the District of Columbia, Guam, and the Virgin Islands. In file mathnaep.dat mathnaep.dat in the course HW directory. the educational achievement of eighth-grade students in mathematics and the fol1owing five explanatory variables (all state-level variables): C1 MATHPROF average mathematics proficiency C2 PARENTS percentage of eighth-grade students with both parents living at home C3 HOMELIB percentage of eighth-grade students with three or more types of reading materials at home (books, encyclopedias, magazines, newspapers) C4 READING percentage of eighth-grade students who read more than 10 pages a day C5 TVWATCH percentage of eighth-grade students who watch TV for six hours or more per day C6 ABSENCES percentage of eighth-grade students absent three days or more last month a. Start with basic data analysis due diligence. Examine scatterplots for anomalous observations and for curvature. Any transformations needed? Obtain a correlation matrix of the predictor variables and outcome. What is the single best predictor of mathprof? b. Use best-subsets regression methods to identify useful prediction models? What is your best candidate? Compare with the second best candidate. For your best model, comment on the observations with the largest standardized residuals. c. Compare your results in part b with the use of Forward Stepwise regression to determine a prediction model. d. For the full set of predictor variables, are there any logical candidates for data reduction (i.e. forming composites). Will any improvements in the regression fits be obtained from using a composite? =-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-= Problem 2 "Simple" Contingency Tables Cheating Father Time. In the SF Chronicle May 20, 2001 the feature "Cheating Father Time: Training, nutrition and medical advances prolonging careers" provides the following data on the increasing longevity of professional athletes. 1990 Number of Players Percent 35 and older players 35 and older League Major League Baseball 94 8.4% National Football League 12 1% National Basketball Association 14 3.6% National Hockey League 14 2.4% 2000 Number of Players Percent 35 and older players 35 and older League Major League Baseball 162 11.7% National Football League 44 2.7% National Basketball Association 41 9.3% National Hockey League 56 7.8% a. For each of the four leagues construct a 2x2 table: player age (35 and older, under 35) and year (1990, 2000). For each table calculate the relative risk of playing (at or) past 35 in the two decades. b. Consider the year 2000 data. For the 2x4 table of player age by sport, test the null hypothesis of independence. Explain what that null hypothesis actually is saying. Construct a display of actual counts, expected counts under independence, and adjusted residuals from the independence model for each cell in the 2x4 structure. c. Calculate the following probability: Given that a professional athlete in one of these four leagues is still playing in the year 2000 at age 35 or over, what's the probability he's a baseball player? Do you have all the information you need to calculate this probability? d. Let's do a meta-analysis. Consider the four leagues as four separate studies. Estimate the overall odds ratio for the 2x2 tables in part a. Give a point estimate of the overall odds ratio and carry out a test that the overall odds ratio is different from 1.0 (independence of year and playing past 35) =-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-= Problem 3 Modeling Multivariate Categorical Data But would you want to matriculate? We consider data on admissions for Fall 1973 graduate study at U.C. Berkeley in the six largest departments. These data among others were the subject of extensive litigation on gender discrimination a few years back. The data on each applicant consists of the applicants gender (G), whether admitted (A) and major department (D). Whether admitted, male Whether admitted, female Dept Yes No Yes No a 512 313 89 19 b 353 207 17 8 c 120 205 202 391 d 138 279 131 244 e 53 138 94 299 f 22 351 24 317 a) To start, construct the marginal AG table (a 2x2 table of gender by admit status). Carry out a test for independence and obtain a point and interval estimate the odds ratio for admittance for this marginal AG table What might this result be taken to indicate about gender equity etc in the admit process? Are you outraged yet? b. Now use the breakdown by department. Obtain the odds ratio for admittance within each of the 6 departments. Does Simpson's paradox appear to be present in these data? Why or why not? c. Use Cochran-Mantel-Haenszel procedures to: test whether conditional independence holds for AG estimate a common odds ratio for the six departments use Breslow-Day statistic to test whether the AG odds-ratio is the same for the 6 departments d. For the possible A G D log-linear models, which model terms would indicate gender discrimination? e. Fit the set of A G D log-linear models using a procedure such as SAS Proc Genmod, and identify what you regard as the most appropriate model. Does this model confirm gender discrimination in admissions? Examine the log-likelihood chi-square and table the fits and adjusted residuals for this model. Are you satisfied with this model? f. Set aside department a and rerun the log-linear model analysis. Interpret your preferred model in terms of gender discrimination in admissions. Also comment on the admissions preferences in dept a. =-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-= Problem 4 --Prediction of Binary Outcomes If you're not crazy yet, you'll do ok. A psychologist conducted a study to examine the nature of the relation if any, between an employee's emotional stability (C2) and the employee's ability to perform in a task group (C1). Data on 27 employees are in file stable.dat. stable.dat in the course HW directory Emotional stability was measured by a written test, and ability to perform in a task group (C1 = 1 if able, C1 = 0 if unable) was evaluated by the supervisor. a. From an OLS fit for a straight-line relation for predicting C1 from C2, what level of emotional stability seems necessary for a probability of successful performance of .70. b. Carry out a fit of a logistic response function to these data What is the predicted probability of success for an employee with the median value of emotional stability? For the logistic fit, what level of emotional stability seems necessary for a probability of successful performance of .75? c. For both the OLS regression and the logistic curve estimation, list the fitted-values for probability of success using the emotional stability values in these data (C2). Comment on the similarity of these two fits. =-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-= END 257 !