Test Scaling and Value-Added Measurement

As currently practiced, value-added assessment relies on a strong assumption about the scales used to measure student achievement, namely that these are interval scales, with equal-sized gains at all points on the scale representing the same increment of student learning. Many of the metrics in which test results are expressed do not have this property (e.g., percentile ranks, normal curve equivalents). However, this property is claimed for the scale scores obtained when tests are scored according to Item Response Theory.

This research investigates the basis for this claim using representational measurement theory (Krantz et al, 1971). Although there are other views of measurement, representational measurement theory is the only tool available for adjudicating such questions objectively. (Rival approaches claim that whether a scale is an interval scale or not depends on one’s “philosophy of measurement”, or else that the question is undecidable.)

This investigation establishes two things.

  • The claim that the IRT scale is an interval scale depends on an arbitrary assumption about the way item difficulty and individual ability interact in determining a student’s response to a question.
  • The assumption is not innocuous. Its implications are strongly at variance with every day notions of ability and achievement.

The first of these results is widely appreciated but the second is not. If we replace the assumption in question with an alternative more consistent with ordinary usage of the term “ability”, we find that IRT scales are compressed at the upper end of the ability distribution: gains of high-performing students are understood vis-a-vis those of students who start at lower points on the scale.

Because teachers are assigned classes with differing mixes of ability (and because even teachers with the same mix of students are not equally effective with all types), value-added measures of teacher effectiveness are sensitive to these scaling assumptions. This study demonstrates this first with simulated data and then using a simple model of value-added for teachers in a large Southern distirct. It concludes with an evaluation of ad hoc normalizations some researchers have adopted to address this problem (e.g., grouping students by prior achievement and measuring individual gains relative to group mean gains).

Publications

Click here to read “Test Scaling and Value-Added Measurement” by Dale Ballou (July 2008).