Education 257 HW4 April 9, 2005 (Due April 18 2005)
Note: from lecture 3/30 good to review HW3 #8 problem and solution
===================================
Variable Selection and Model Building
=====================================
The Job proficiency data described p.356-7 (ver 4) p 377 (ver5)
of NWK resides as jobprof.dat.
From problem description:
A personnel officer in a governmental agency administered four
newly developed aptitude tests to each of 25 applicants for
entry-level clerical positions in the agency. For purposes of the
study, all 25 applicants were accepted for positions irrespective
of their test scores. After a probationary period, each applicant
was rated for proficiency on the job. The scores on the four
tests X1-X4 are in columns c2-c5 and the job proficiency score Y
is in c1. Note: the data display in NWK has Y in the rightmost
column; that's not the way the data are stored.
======================================================================
Prob 1.
---------
For these data complete the following (parts of) problems:
8.11 (ver4) 9.10 (ver5) parts a-c
8.12 (ver 4) 9.11 (ver5) a, b (use BREG)
8.19 (ver4) 9.18 (ver5) a, b
-----------------------------------------------------------
Prob 2
---------
Instead of considering the aptitude tests in Part 1 as separate
candidate predictors, lets see if some composite of the 4 tests
is useful. (this is probably not a realistic opportunity as it's
unlikely all 4 tests would be given routinely).
construct 2 composite measures:
1. sum (or mean) of the four tests (reasonable since they are on the
same scale) and
2. standardized sum (i.e. standardize and add 'em up).
(e.g., use Minitab CENTER command)
which of the composite measures is the best predictor?
how does it compare with some of the prediction eq's in Part 1?
consider also forming a composite from 1st principal
component of the aptitude tests and using that as a predictor
-----------------------------------------------------------------
Prob 3
-------
Refer to the Course Example pca257 (data in pcamarks on the web-page)
Extend the analysis in the course example by doing the following.
Use the principal components (e.g. obtained using MINITAB pca) for
the six graded homework assignments as potential predictors of
the final exam in 250B, along with the final exam for 250A and
the midterm in 250B. Carry out an appropriate variable selection
procedure to build a prediction equation. What is the most
attractive prediction eq? why? How competitive is the next most
atractive equation?
==============
Prob 4
In the file 'gpa.dat' are two sets of data each on 100 cases.
The first set is contained in the first three columns and the
second set in the next three columns. For each individual the three
observations are VerbalSAT MathSAT and GPA .
The problem asks you to construct a simple cross-validation
procedure. For the first 100 cases predict GPA using the two SAT
scores. This yields estimated regression parameters and a squared
multiple correlation. Now lets turn to the second sample of 100
cases. Use the regression coefficients from the first sample to
form a predicted outcome for each of the 100 individuals in the
second sample. Compute an imitation R-squared and compare with that
for an actual multiple regression for the second sample. Which is
larger? Why? You could of course reverse this process by starting
with the second sample instead of the first.
---------------------------------------------------------
Advanced Topics (you can treat these as optional but interesting)
Prob 5. Path analysis
Consider the published path analysis depicted in
http://www.stanford.edu/class/ed260/allisonWebex1.jpg
write out the indicated multiple regression equations
from this path analysis diagram
From the 5x5 correlation matrix
Correlation Matrix
class 1.00
famsize -.33 1.00
ability .39 -.33 1.00
esteem .14 -.14 .19 1.00
achieve .43 -.28 .67 .22 1.00
obtain standardized path coefficients and propose
substantive interpretation.
---------------
Problem 6. Multilevel data
NELS data from Kreft text)
Data summaries for the 10 school example are given below.
Fit Math score on Homework regressions
From these data summaries obtain the three regression slopes
discussed in contextual analysis:
total
between-school,
within-school pooled.
Verify the Duncan-Cuzort-Duncan relationship.
Table 1 Ten selected schools from NELS-88: within-school means
School Size Math mean Homework mean
1 23 45.8 1.39
2 20 42.2 2.35
3 24 53.2 1.83
4 22 43.6 1.64
5 22 49.7 0.86
6 20 46.4 1.15
7. 67 62.8 3.30
8 21 49.6 2.10
9 21 46.3 1.33
10 20 47.8 1.60
Table 1 gives the mean math score (number correct)
amounts of homework (in hours per week),
Table 2
Ten selected schools from NELS-88:
within-school dispersions and correlations
School Dispersion Correlation
A 55.2 -4.24 -0.52
-4.24 1.19
B 65.1 -4.65 -0.45
-4.65 1.63
C 126.3 9.62 0.77
9.62 1.22
D 94.1 11.9 0.84
11.9 2.14 .
E 69.2 -2.71 -0.43
-2.71 0.57
F 17.0 -1.56 -0.48
-1.56 0.63
G 31.2 3.24 0.34
3.24 2.92 .
H 101.1 7.94 0.71
7.94 1.22 .
I 86.6 4.61 0.56
4.61 0.79 .
J 120.9 12.3 0.80
12.3 1.94 .
========================================
end