A few weeks ago, Audrey Watters wrote a great piece on her concerns about robo-grading of essays. (I tend to take a lot of inspiration from the things that annoy Audrey, in part because they usually annoy me too.) Here’s the crux of her argument:
According to Steve Kolowich’s Inside Higher Ed story, [educational researcher Mark] Shermis “acknowledges that [Automated Essay Scoring] software has not yet been able to replicate human intuition when it comes to identifying creativity. But while fostering original, nuanced expression is a good goal for a creative writing instructor, many instructors might settle for an easier way to make sure their students know how to write direct, effective sentences and paragraphs. ‘If you go to a business school or an engineering school, they’re not looking for creative writers,’ Shermis says. ‘They’re looking for people who can communicate ideas. And that’s what the technology is best at’ evaluating.”
Why are nuance and originality just the purview of the creative writing department? Why are those things seen here as indirect or ineffective? Why do we think creativity is opposed to communication? Is writing then just regurgitation?
What sorts of essays gain high marks among the SAT graders – human now or robot in the future? Are these the sorts of essays that students will be expected to write in college? Is this the sort of writing that a citizen/worker/grown-up will be expected to produce? Or, for the sake of speed and cost effectiveness, in Vander Ark’s formulation, are we promoting one mode of writing for standardized assessments at the K–12 level, only to realize when students get to college and to the job market that, alas, they still don’t know how to write?
How can we get students to write more? How can we help them find their voice and hone their craft? How do we create authentic writing assignments and experiences – ones that appeal to real issues and real discourse communities, not just to robot graders? How do we encourage students to find something to say and to write that something well? Is that by telling them that their work will be assessed by an automaton?
How do we support the instructors who have to read student papers and offer them thinking and writing guidance? When we talk about saving time and money here, whose bottom line are we really looking out for? Who’s really interested in this robot grader technology? And why? [Emphasis added.]
This is a classic case of a market gone awry. Machine learning is sold as an “efficiency” tool, because there is money in squeezing cost out of education. In and of itself, there’s nothing wrong with wanting education to be cost-effective. David Wiley’s formulation of “standard deviations per dollar” has both a numerator and a denominator. You can attack either number and still affect the ratio. The problem with obsessing over the denominator is that you start forgetting that “cost-effective” has to be effective. If you want to know what the ongoing industrialization of education looks like in the post-industrial world, robo-grading is it. We are reducing the evaluation to the least common denominator, where the denomination is in dollars.
But it doesn’t have to be that way. What if we looked at machine learning (the technology that makes robo-grading possible) from the perspective of trying to raise the numerator, i.e., effectiveness, while keeping cost the same? How could the technology be used as a force multiplier for good teachers, helping them to focus on what they do best in roughly the same way that flipping the classroom is supposed to do? If the goal is teaching better rather than just teaching cheaper, then what is machine learning good for?
One of the links in Audrey’s post was to an excellent three–part series on the topic by Justin Reich at Education Week. I found this particular passage in part 3 to be particularly thought-provoking. Reich suggests that he would like to train a robo-grader to give rubric-based feedback on student short answer questions as a first draft, so that the second draft that they submit for human review by their teacher would be further along and ready for more nuanced comments:
Before I evaluate the essays, I’m going to craft six messages that I anticipate having to use to give feedback. One might be “This paragraph starts with a fact. In short expository writing, it’s often more effective to start with your argument, and then support that argument with evidence.” Another might be, “You make a clear argument here, but you need to support your assertions with evidence from the Balfour Declaration and your knowledge of the period.” A third might be, “It is not clear what the argument of this paragraph is. Re-read the paragraph, and try to craft a single sentence that summarizes the key point you are trying to convey.”
I have students submit their essays to the Lightside add-in for Moodle (this doesn’t exist yet, but is technically very feasible). Lightside is an open-source, free, automated essay scoring tool. When I evaluate student essays, I give them their 1-6 grade, check any of the six relevant boxes for the pre-scripted feedback, and write any additional comments that I’d like to make. In year 1, this feedback is all that students get.
Fast forward to year 2. Students do the same assignment (my curriculum evolves from year to year, but good stuff is retained). They submit the assignment to the Lightside add-in for Moodle, but this year, something very different happens. Lightside uses my feedback from last year to provide immediate feedback to this year’s students. Upon receiving a student submission, Lightside instantly sends a message saying something like “Essays similar to this one earned a 4/6 on this assignment. Essays similar to this one also received the following feedback: ‘You make a clear argument here, but you need to support your assertions with evidence from the Balfour declaration and your knowledge of the period.’ Please review your submission, and see if this feedback helps you improve it.” Instead of waiting a minimum of 3 days for feedback from me, students instantly get advice they can use to sharpen their writing.
Not only do the students receive instant feedback on their submissions, but as an instructor, I receive a report that details the overall performance of the students. Perhaps my report indicates that 51 out of 80 students received the message: “This paragraph starts with a fact. In short expository writing, it’s often more effective to start with your argument, and then support that argument with evidence.”
For many students, they will have some sense of how they might respond to that feedback, but many won’t know what to do. So in class the following day, we do two things. First, I have a short mini-lesson on writing topic sentences for paragraphs, where I offer some general principles and workshop a couple examples on the board. I video record this mini-lesson, so that in year 3 when students get this feedback, the Lightside add-in will link to this mini-lesson when it gives the related feedback. Then, I give students 10 minutes to peer edit each other’s topic sentences. Finally, I tell them all to revise their paragraphs and resubmit them for homework.
Now, when those final submissions come in, I still grade all of them for 3 minutes each, 240 minutes total, but that time is spent totally differently. Students have used my algorithmically-generated feedback to improve their pieces beyond the achievements of the previous years’ students. Having used technology to coach students algorithmically, I now use my specialized experience as a teacher to continue pushing students to the next level. This is exactly how I have my students use spellcheck, grammar check, plagiarism check, and other algorithmic tools now. It’s worth noting that many educators decried the creation of spellcheck and grammar check as degrading student writing, but nearly every professional writer depends upon these two tools and most educators expect students to use these algorithmic writing coaches in their practice, since they allow students to focus on more cognitively demanding editorial routines.
Now that is interesting. But is it feasible? I called up Elijah Mayfield, the primary developer of LightSIDE at Carnegie Mellon University to ask him about that as well as other application questions.
As I suspected, the sample size from one teacher is too small. It would likely take a very long time of hand-grading to gather enough data to train the machine. But what if you could couple it with a social platform for scoring and norming? If you could identify a handful of teachers who have similar student populations and grading styles, then you could gather enough scoring information in an acceptably short period of time. You use crowd-sourcing to train the machine. You use the machine learning to score basic, first-draft kinds of concerns. And you use the time freed up from human scoring of first-draft concerns to focus on the real craft and nuance questions that require good teachers.
Better Class Discussions
Some of the most interesting research that Elijah and his colleagues are doing regarding machine learning and education goes well beyond robo-grading. For example, they have been studying whether machine-detectable conversation styles in student group work can impact learning outcomes. Starting from a theoretical framework called systemic functional linguistics, they looked for linguistic markers of students’ “authoritativeness” in group work, which might be thought of as expressions of confidence in one’s knowledge. As they put it, ‘[A]n authoritative presentation of knowledge is one that is presented without seeking external validation for the knowledge.” Note that they explicitly distinguish “authoritativeness,” which is a social stance you take in conversation, from “self-efficacy,” which is your actual internal level of confidence. A person can take an authoritative stance in a conversation even though they are not actually self-confident, or they can choose to lean back and defer to other group members even though they are confident. Interestingly, the researchers believe that they can detect and measure authoritativeness and self-efficacy separately.
Anyway, they went into their research with a couple of hypotheses. First, students who adopt authoritative conversation styles in group work will tend to learn more (as shown by differences between pre- and post-test scores) than students who don’t. Second, the conversation style of one participant can be influenced by the styles of the others. Without going into details (which you can read for yourself here), they found that both of those hypotheses proved to be correct. In a later paper, they showed that they could train LightSIDE to detect authoritativeness in student discussion. This is initial research, but if it holds across broader testing, then students could be coached on both how to conduct more effective class discussions and also elicit more productive conversational styles from their partners. I can imagine a gamification system in which students get points and badges both for taking more authoritative positions in discussion themselves and from eliciting more authoritative positions from their classmates. I can also imagine a system that provides machine scoring of certain characteristics of students’ class participation and provides links to examples for teachers to evaluate for themselves. Once again, the idea would be to have the machine evaluate and coach students on things that machines are good at evaluating and freeing up the humans’ time for more nuanced work. It is a kind of classroom flipping.
Hacking the Markets
So far, the educational markets have rewarded applications of the technology that enhance efficiency, which is one major reason why we haven’t seen much in the way of the kinds of applications I just described in commercial products. I think the market dynamics are shifting, though, for a couple of reasons. First, Federal and state money is starting to get tied to demonstrations of effectiveness. If public colleges and universities (and K-12 schools) want to survive, they increasingly have to do more than just operate more cheaply in shrinking budgets. The size of their budgets will be determined by their ability to demonstrate educational effectiveness. Second, as both LMS and textbook markets commoditize, the key way to fight off commoditization is to demonstrate that their solutions are actually more educationally effective. For example, an LMS that actually gives teachers reason to use the discussion boards because it gives them specific tools that have been proven to help students have more educationally effective discussions will have a differentiator that adding a new blogging tool or Google Apps widget no longer does. Schools and teachers can help drive this by emphasizing effectiveness in their educational technology and etextbook selection processes.
But I also think that we need to get better at having broader, more application-focused collaboration between educators, their institutions, and their vendors. One of the interesting aspects of LightSIDE is that it is designed for non-experts in Machine Learning to use. We should be building up a collective body of knowledge regarding how to use machine learning and other innovations to improve education. We have let the markets shape the potential uses of the technology rather than letting the potential uses of the technology to improve education shape the markets.
Jeroen Fransen says
Michael, this is such an interesting piece that I almost fell off my chair when reading it. What you’re describing is almost exactly what we’re doing at Joyrite. We combine teachers’ strengths with those of algorithms and those of experts and crowds. I often – half-jokingly – refer to it as creating the bionic teacher.
Wish we could have a one-on-one exchange some day.