An F for E-rater

August 30, 2004

An F for E-rater

Why do we want K-12 students to learn to write? Why do we test them more and more to find out if they can?

Perhaps it’s just a simple matter of minimal adult competency in a society where writing is a crucial part of adult life. We want people to be able to read and write memos, emails, business letters. We don’t care much about the content of what most people write, we just want the bare minimum mastery of grammar, spelling and basic sentence construction to insure that all adults who have finished high school are at least at the ground floor when it comes to writing and reading. If that’s all we want, then E-Rater, an expert AI system that can grade essays for standardized tests, is just fine.

Grading for grammar, basic construction and spelling, after all, is one of the most mind-numbing exercises a teacher can face. In one sense, it’s easy; in another sense, it’s painfully difficult. Anyone who catches my grammatical errors in these blog entries (there’s about two per entry usually, mostly agreement errors resulting from hurried cut-and-paste edits) can see how little I enjoy doing it myself. If I were being asked to evaluate a pile of 100 essays on these measures of assessment alone, I’d gladly turn to an automated system to do it for me and quickly double-check its work later.

Is that really all we want for our children? Is that really the major reason why we hammer kids with five zillion standardized tests before they finish high school? If we believe that writing is the primary way that an educated democratic citizenry expresses its views, if we think writing isn’t so much the issue as writing persuasively is, if we value the ability of adults to communicate effectively with each other through writing as opposed to merely meeting the minimal standard of literacy, then E-rater or anything like it is a complete disaster in the making.

In this respect, the silly mistakes that E-rater or anything like it makes, as described in the Inquirer article, are relatively trivial, though hardly insignificant. It’s been known for a while that automated essay correction (including MS Word’s grammar checker) runs into serious trouble when it comes up against unusual or innovative writing. There have already been cases of K-12 students getting low automated assessments on essays that any human reader would recognize as unusually skilled or innovative.

The flip side of that is, as the article notes, the systems are pretty easy to spoof or fool if you know how they work. Since standardized testing is already largely a case of testing how well a student knows how to take a test rather than testing for real and useful competencies, this will only aggravate the problem. If you don’t think students will find out how to spoof automated essay graders in this way, then you’ve never played a multiplayer computer game. If there’s an algorithm with weak knees, count on people smacking it across the digital kneecaps with a crowbar.

The real problem is that an automated system can never judge what is persuasive, at least not until such time as AIs are sentient. Talking optimistically about Deep Blue in this context is silly. Deep Blue still doesn’t understand the game of chess—it just so happens that chess, being what it is, is a game that can be mastered through brute force calculation. It is not a matter of persuasive writing simply being a much more complicated instance of the same thing. It is not and can never be: persuasion is an ethic, a philosophy, a choice, as well as a psychological affect that is by necessity a moving target—the rhetorical turns and argumentative structure that persuades a particular reading audience today may not be what persuades them tomorrow. You have to believe in the value, the meaning, the utility and the ethical centrality of reasoned persuasion and communicative action in order to value it in a student’s writing. There is no algorithm for that, not yet and maybe not ever.