The Bias Inherent in Principal Observation

In the ongoing dispute concerning teacher evaluation, most of the furor has concerned value-added teacher evaluations—those based on student test scores. The many problems with these, such as teachers being evaluated based on entire schools’ test scores or on test scores of students they have never taught, have been identified. Because of this research, good work has begun to correct these errors.

But what about in-person observation by principals? Observations have been a cornerstone of teacher evaluation for years, and assessment by principals has gone largely unquestioned. A new report from the Brown Center on Education Policy at the Brookings Institution, “Evaluating Teachers with Classroom Observations: Lessons Learned in Four Districts,” however, has uncovered an inherent bias in principal observations. It turns out that there is an inherent tendency for observers to give higher marks to those teachers with higher performing students and to give lower marks to those teachers with already lower performing students.

“It’s very worrisome. It’s a huge bias,” said Grover J. “Russ” Whitehurst, the director of the Brown Center. “The criticism about value-added is certainly something we need to attend to, but a lot of work has helped reduce or eliminate that bias. None of that’s being done for observation scores.”

Key findings and resulting recommendations include:

Under current teacher evaluation systems, it is hard for a teacher who doesn’t have top students to get a top rating. Teachers with students with higher incoming achievement levels receive classroom observation scores that are higher on average than those received by teachers whose incoming students are at lower achievement levels, and districts do not have processes in place to address this bias. Adjusting teacher observation scores based on student demographics is a straightforward fix to this problem. Such an adjustment for the makeup of the class is already factored into teachers’ value-added scores; it should be factored into classroom observation scores as well.
The reliability of both value-added measures and demographic-adjusted teacher evaluation scores is dependent on sample size, such that these measures will be less reliable and valid when calculated in small districts than in large districts. Thus, states should provide prediction weights based on statewide data for individual districts to use when calculating teacher evaluation scores.
Observations conducted by outside observers are more valid than observations conducted by school administrators. At least one observation of a teacher each year should be conducted by a trained observer from outside the teacher’s school who does not have substantial prior knowledge of the teacher being observed.
The inclusion of a school value-added component in teachers’ evaluation scores negatively impacts good teachers in bad schools and positively impacts bad teachers in good schools. This measure should be eliminated or reduced to a low weight in teacher evaluation systems.

For more information, please visit:

http://www.edweek.org/ew/articles/2014/05/13/32observe.h33.html

and

http://www.brookings.edu/research/reports/2014/05/13-teacher-evaluation-whitehurst-chingos