Group vs. individual uses of data

Andrew Gelman notes that, on the subject of value-added assessments of teachers, "a skeptical consensus seems to have arisen..." How did we get here? Value-added assessments grew out of the push for more emphasis on measuring success through standardized tests in education -- simply looking at test scores isn't OK because some teachers are teaching in better schools or are teaching better-prepared students. The solution was to look at how teachers' students improve in comparison to other teachers' students. Wikipedia has a fairly good summary here.

Back in February New York City released (over the opposition of teachers' unions) the value-added scores of some 18,000 teachers. Here's coverage from the Times on the release and reactions.

Gary Rubinstein, an education blogger, has done some analysis of the data contained in the reports and published five posts so far: part 1, part 2, part 3, part 4, and part 5. He writes:

For sure the 'reformers' have won a battle and have unfairly humiliated thousands of teachers who got inaccurate poor ratings. But I am optimistic that this will be be looked at as one of the turning points in this fight. Up until now, independent researchers like me were unable to support all our claims about how crude a tool value-added metrics still are, though they have been around for nearly 20 years. But with the release of the data, I have been able to test many of my suspicions about value-added.

I suggest reading his analysis in full, or at least the first two parts.

For me one early take-away from this -- building off comments from Gelman and others -- is that an assessment might be a useful tool for improving education quality overall, while simultaneously being a very poor metric for individual performance. When you're looking at 18,000 teachers you might be able to learn what factors lead to test score improvement on average, and use that information to improve policies for teacher education, recruitment, training, and retention. But that doesn't mean one can necessarily use the same data to make high-stakes decisions about individual teachers.