Education 257  FINAL PROBLEMS, Spring 2005  JUNE 1, 2005

Solutions for these problems are to be submitted in hard-copy
form. Given that these problems are untimed, some care should be
taken in presentation, clarity, format.  Especially important is
to give full and clear answers to questions, not just to submit
unannotated computer output, although relevant output should
be included.
You may use any inanimate resources--no collaboration.  This
work is done under Stanford's Honor Code.
Please read the questions carefully and answer the question that
is asked.  
Papers will be scored into 3 categories: "Excellent" indicates
successful completion of all parts of all questions (within
perhaps one or two very trivial arithmetic errors);
"Satisfactory" indicates a good attempt was made at all parts of
all problems, but there were some serious errors or omissions;
"Incomplete" indicates inadequate effort or performance. 

Place completed hard copy in Rogosa's Cubberley or Sequoia Hall mailbox 
by 5PM Friday 6/10

Data sets: I took the extra effort to link data directly
from this assignment document
data reside in the class HW directory 
URL is[file]

One loose end: Course Evaluations for students not in the School
of Education. I will place a few blank forms in my Sequioa mailbox
and these can be returned to the Regisrrar's office, if you have not
already done one.


Problem 1, Model Building, Variable Selection

Can anyone do math? Was it your Parents fault?

Relation of educational achievement of students to the home 
environment. Data on average mathematics proficiency (MATHPROF) and 
the home environment variables were obtained from the 1990 National 
Assessment of Educational Progress for 37 states, the District of 
Columbia, Guam, and the Virgin Islands.

In file mathnaep.dat mathnaep.dat
in the course HW directory. 
the educational achievement of eighth-grade 
students in mathematics and the fol1owing five explanatory variables 
(all state-level variables):

C1 MATHPROF average mathematics proficiency 

C2 PARENTS  percentage of eighth-grade students with both parents living at home

C3 HOMELIB  percentage of eighth-grade students with three or more types of
 reading materials at home (books, encyclopedias, magazines, newspapers)

C4 READING  percentage of eighth-grade students who read more than 10 
pages a day

C5 TVWATCH  percentage of eighth-grade students who watch TV for six 
hours or more per day

C6 ABSENCES  percentage of eighth-grade students absent three days or 
more last month

a. Start with basic data analysis due diligence. Examine scatterplots
   for anomalous observations and for curvature. Any transformations needed?
   Obtain a correlation matrix of the predictor variables and outcome.
   What is the single best predictor of mathprof?
b. Use best-subsets regression methods to identify useful prediction models?
   What is your best candidate? Compare with the second best candidate.
   For your best model, comment on the observations with the largest 
   standardized residuals.

c. Compare your results in part b with the use of Forward Stepwise regression
   to determine a prediction model.

d. For the full set of predictor variables, are there any logical candidates
   for data reduction (i.e. forming composites). Will any improvements in the
   regression fits be obtained from using a composite?


Problem 2  "Simple" Contingency Tables

Cheating Father Time. In the SF Chronicle May 20, 2001 the feature
"Cheating Father Time: Training, nutrition and medical advances
prolonging careers" provides the following data on the increasing
longevity of professional athletes.

                                Number of Players         Percent
                                  35 and older     players 35 and older
Major League Baseball                 94                 8.4%
National Football League              12                 1%
National Basketball Association       14                 3.6%
National Hockey League                14                 2.4%

                                Number of Players        Percent
                                  35 and older     players 35 and older
Major League Baseball                 162                11.7%
National Football League               44                 2.7%
National Basketball Association        41                 9.3%
National Hockey League                 56                 7.8%

a. For each of the four leagues construct a 2x2 table: player age
(35 and older, under 35) and year (1990, 2000).  For each table
calculate the relative risk of playing (at or) past 35 in the
two decades.

b. Consider the year 2000 data. For the 2x4 table of player
age by sport, test the null hypothesis of independence. Explain
what that null hypothesis actually is saying.  Construct a
display of actual counts, expected counts under independence,
and adjusted residuals from the independence model for each cell
in the 2x4 structure.

c. Calculate the following probability:
Given that a professional athlete in one of these four
leagues is still playing in the year 2000 at age 35 or over,
what's the probability he's a baseball player? Do you
have all the information you need to calculate this

d. Let's do a meta-analysis. Consider the four leagues as four
separate studies. Estimate the overall odds ratio for the 2x2 tables
in part a. Give a point estimate of the overall odds ratio and carry 
out a test that the overall odds ratio is different from 1.0 
(independence of year and playing past 35)


Problem 3  Modeling Multivariate Categorical Data

But would you want to matriculate? 
   We consider data on admissions for Fall 1973 graduate study at 
U.C. Berkeley in the six largest departments.  These data among others
were the subject of extensive litigation on gender discrimination 
a few years back.

The data on each applicant consists of the applicants gender (G), 
whether admitted (A) and major department (D).
        Whether admitted, male         Whether admitted, female

Dept       Yes         No                    Yes       No
a          512        313                    89        19 
b          353        207                    17         8
c          120        205                   202       391 
d          138        279                   131       244 
e           53        138                    94       299 
f           22        351                    24       317 

a) To start, construct the marginal AG table (a 2x2 table of gender by admit 
status).  Carry out a test for independence and obtain a point and
interval estimate the odds ratio for admittance for this marginal AG table
What might this result be taken to indicate about gender equity etc in the 
admit process? Are you outraged yet?

b. Now use the breakdown by department. Obtain the odds ratio for admittance
within each of the 6 departments. Does Simpson's paradox appear to be present 
in these data?  Why or why not?

c. Use Cochran-Mantel-Haenszel procedures to:
test whether conditional independence holds for AG
estimate a common odds ratio for the six departments
use Breslow-Day statistic to test whether the AG odds-ratio
is the same for the 6 departments

d. For the possible A G D log-linear models, which model terms
would indicate gender discrimination?

e. Fit the set of A G D log-linear models using a procedure such as
SAS Proc Genmod, and identify what you regard as the most appropriate 
model. Does this model confirm gender discrimination in admissions?
Examine the log-likelihood chi-square and table the fits and adjusted 
residuals for this model. Are you satisfied with this model? 

f. Set aside department a and rerun the log-linear model analysis.
Interpret your preferred model in terms of gender discrimination
in admissions. Also comment on the admissions preferences in dept a.

Problem 4 --Prediction of Binary Outcomes
If you're not crazy yet, you'll do ok.  

A psychologist conducted a study to examine the nature of the relation 
if any, between an employee's emotional stability (C2) and the
employee's ability to perform in a task group (C1). Data on 27 employees
are in file stable.dat.
in the course HW directory

Emotional stability was measured by a written test, and ability to perform 
in a task group (C1 = 1 if able, C1 = 0 if unable) was evaluated by the 

a.   From an OLS fit for a straight-line relation for predicting C1 from C2, 
     what level of emotional stability seems necessary for a probability of 
     successful performance of .70.
b.   Carry out a fit of a logistic response function to these data 
     What is the predicted probability of success for an employee with the 
     median value of emotional stability?
     For the logistic fit, what level of emotional stability seems necessary 
     for a probability of successful performance of .75?
c.   For both the OLS regression and the logistic curve estimation,
     list the fitted-values for probability of success using the emotional 
     stability values in these data (C2). Comment on the
     similarity of these two fits.


END 257 !