Last week, edX made a splashy spectacle of an announcement about automated essay grading, leaving educators fuming. Let’s rethink their claims.
“Give Professors a break,” the New York Times suggested in this joint press release from edX, Harvard, and MIT. The breathless story weaves a tale of robo-professors taking over the grading process, leaving professors free to kick back their feet and take a nap, and subsequently inviting universities, ever-focused on the bottom-line, to fire all the professors. If I had set out to write an article intentionally provoking fear, uncertainty, and doubt in the minds of teachers and writers, I don’t think I could have done any better than this piece.
Anyone who’s seen their claims published in science journalism knows that the popular claims bear only the foggiest resemblance to academic results. It’s unclear to me whether the misunderstanding is due to edX intentionally overselling their product for publicity, or if something got lost in translation while writing the story. Whatever the cause, the story was cocksure and forceful about auto-scoring’s role in shaping the future of education.
I was a participant in last year’s ASAP competition, which served as a benchmark for the industry; the primary result of this, aside from convincing me to found LightSIDE Labs, is that I get email; a lot of email. I’ve been told that automated essay grading is both the salvation of education and the downfall of modern society. Naturally, I have strong opinions about that, based both on my experience with developing the technology and participating in the contest, as well as in the conversations I’ve had since then.
Before we resign ourselves to burning the AI researchers at the stake, let’s step back for a minute and think about what the technology actually does. Below, I’ve tried to correct the most common fallacies I’ve seen coming both from articles like the edX piece, as well as the incendiary commentary that it provokes.
Myth #6: Automated essay grading is reading essays
Nothing will ever puzzle me like the way journalists require machine learning to behave like a human. When we talk about machine learning “reading” essays, we’re already on the losing side of an argument. If science journalists continue to conjure images of robots in coffee shops poring over a stack of papers, it will seem laughable, and rightly so.
To read an essay well, we’ve learned for our entire lives, you need to appreciate all of the subtleties of language. A good teacher reading through an essay will hear the author’s voice, to look for a cadence or rhythm in writing, to appreciate the poetry in good responses to even the most banal of essay prompts.
LightSIDE doesn’t read essays – it describes them. A machine learning system does pour over every text it receives, but it is doing what machines do best – compile lists and tabulate them. Robotically and mechanically, it is pulling out every feature of a text that it can find, every word, every syntactic structure, and every phrase.
If I were to ask whether a computer can grade an essay, many readers will compulsively respond that of course it can’t. If I asked whether that same computer could compile a list of every word, phrase, and element of syntax that shows up in a text, I think many people would nod along happily, and few would be signing petitions denouncing the practice as immoral and impossible.
Myth #5: Automated grading is “grading” essays at all
Take a more blatantly obvious task. If I give you two pictures, one of a house and one of a duck, and asked you to find the duck, would you be able to tell the two apart?
Let’s be even more realistic. I give you two stacks of photographs. One is a stack of 1,000 pictures of the same duck, and one is a stack of 1,000 pictures of the house. However, they’re not all good pictures. Some are zoomed out and fuzzy; others are way too small, and you only get a picture of a feather or a door handle. Occasionally, you’ll just get a picture of grass, which might be either a front lawn or the ground the duck is standing on. Do you think that you could tell me, after poring over each stack of photographs, which one was a pile of ducks? Would you believe the process could be put through an assembly line and automated?
Automated grading isn’t doing any more than this. Each of the photographs in those stacks is a feature. After poring over hundreds or thousands of features, we’re asking machine learning to put an essay in a pile. To a computer, whether this is a pile of ducks and a pile of houses, or a pile of A essays and a pile of C essays, makes no difference. The computer is going to comb through hundreds of features, some of them helpful and some of them useless, and it’s going to put a label on a text. If it quacks like a duck, it will rightly be labeled a duck.
Myth #4: Automated grading punishes creativity (any more than people do)
This is the assumption everyone makes about automated grading. Computers can’t feel and express; they can only robotically process data. This inevitably must lead to stamping out any hint of humanity from human graders, right?
Well, no. Luckily, this isn’t a claim that the edX team is making. However, by not addressing it head-on, they left themselves (and, by proxy, me, and everyone else who cares about the topic) open to this criticism, and haven’t done much to assuage people’s concerns. I’ll do them a favor and address it on their behalf.
An Extended Metaphor
Go back to our ducks and houses. As obvious as this task might be to a human, we need to remember once again, that machines aren’t humans. Presented with this task with no further explanation, not only would a computer do poorly at it; it wouldn’t be able to do it at all. What is a duck? What is a house?
Machine learning starts at nothing – it needs to be built from the ground up, and the only way to learn is by being shown examples. Let’s say we start with a single example duck and its associated pile of photographs. There will be some pictures of webbed feet, an eye, perhaps a photograph of some grass. Next, a single example house; its photographs will have crown molding, a staircase; but it’ll also have some pictures of grass, and some photographs might be so zoomed in that you can’t tell whether you’re looking at a feather or just some wallpaper.
Now, let’s find many more ducks, and give them the same glamour treatment. The same for one hundred houses. The machine learning algorithm can now start making generalizations. Somewhere in every duck’s pile, it sees a webbed foot , but it never saw a webbed foot in any of the pictures of houses. On the other hand, many of the ducks are standing in grass, and there’s a lot of grass in most house’s front lawns. It learns from these examples – label a set of photographs as a duck if there’s a webbed foot, but don’t bother learning a rule about grass, because grass is a bad clue for this problem.
This problem gets to be easy rather quickly. Let’s make it harder and now say that we’re trying to label something as either a house or an apartment. Again, every time we get an example, the machine learning model is given a large stack of photographs, but this time, it has to learn more subtle nuances. All of a sudden, grass is a pretty good indicator. Maybe 90% of the houses have a front lawn photographed at one point or another, but since most of the apartments are in urban locations or large complexes, only one out of every five has a lawn. While it’s not a perfect indicator, that feature suddenly gets new weight in this more specific problem.
What does this have to do with creativity? Let’s say that we’ve trained our house vs. apartment machine learning system. However, sometimes there are weird cases. My apartment in Pittsburgh is the first floor of a duplex house. How is the machine learning algorithm supposed to know about that one specific new case?
Well, it doesn’t have to have matched up this exact example before. Every feature that it sees, whether it’s crown molding or picket fence, will have a lot of evidence backing it up from those training examples. Machine learning isn’t a magic wand, where a one-word incantation magically produces a result. Instead, all of the evidence will be weighed and a decision will be made. Sometimes, it’ll get the label wrong, and sometimes even when it’s the “right” decision, there’ll be room for disagreement. But unlike most humans, with a machine learning system we can point to exactly the features being used, and recognize why it made that decision. That’s more than can be said about a lot of subjective labeling done by humans.
Back to Essay Grading
All of the same things that apply to ducks, houses, and apartments apply to essays that deserve an A, a B, or a C. If a machine grading system is being asked to label essays with those categories, then machine learning will start out with no notion of what that means. However, after many hundreds or thousands of essays are exhaustively examined for features, it’ll know what features are common in the writing that teachers graded in the A pile, in the B pile, and in the C pile.
When a special case arrives, an essay that doesn’t fit neatly into the A pile or the B pile, we’d have no problem admitting that a teacher has to make a judgment call by weighing multiple sources of evidence from the text itself. Machine learning learns to mimic this behavior from teachers. For every feature of a text – conceptually no different from poring over a stack of photographs of ducks – the model checks whether it has observed similar features from human graders before, and if so, what grade the teacher gave. All of this evidence will be weighed and a final grade will be given. What matters, though, might not be the final grade – instead, what matters is the text itself, and the characteristics that made it look like it deserved an A, or a C, or an F. What matters is that the evidence used is tied back to human behaviors, based on all the evidence that the model has been given.
Myth #3: Automated grading disproportionately rewards a big vocabulary
Every time I talk to a curious fan of automated scoring, I’m asked, “What are the features of good writing? What evidence ought to be used?” This question flows naturally, but the easy answers are thoughtless ones. The question is built on a bad premise. Yes, there are going to be some features that are true in almost all good writing, with connective vocabulary words and transition function words at the start of paragraphs. These might be like webbed feet in photos of ducks – we know they’ll always be a good sign. Almost always, though, the weight of any one feature depends on the question being asked.
When I work with educators, I recommend not just that they collect several hundred essays. I ask that they collect several hundred essays, graded thoroughly by trained and reliable humans, for every single essay question they intend to assign. This unique set allows the machine learning algorithm to learn not just what makes “good writing” but what human graders were using to label answers as an A essay or a C essay in that specific, very targeted domain.
This means that we don’t need to learn a list of the most impressive-sounding words and call it good writing; instead, we simply need to let the machine learning algorithm observe what humans did when grading those hundreds of answers to a single prompt.
Take, as an example, the word “homologous.” Is an essay better if it uses this word instead of the word “same”? In the general case, no; I dare anyone to collect a random sampling of 1,000 essays and show me a statistical pattern that human graders were more likely to grade a random essay a higher grade if they were to make that swap. It’s simply not how human teachers behave, it won’t show up in statistics, and machine learning won’t learn that behavior.
On the other hand, let’s say an essay is asking a specific, targeted question about the wing structure of birds, and the essay is being used in a college freshman-level course on biology. In this domain, if we were to collect 1,000 essays that have been graded by professors, a pattern is likely to emerge. The word “homologous” will likely occur more often in A papers than C papers, based on the professors’ own grades. Students who use the word “homologous” in place of the word “same” have not singularly demonstrated, with their mastery of vocabulary, that they understand the field; however, it’s one piece of evidence in a larger picture, and it should be weighted accordingly. So, too, will features of syntax and phrasing, all of which will be used as features by a machine learning algorithm. These features will only be given weight in machine learning’s decision-making to the extent that they matched the behavior of human graders. By this specialized process of learning from very targeted datasets, machine learning can emulate human grading behavior.
However, this leads into the biggest problem with the edX story.
Myth #2: Automated grading only requires 100 training examples.
Machine learning is hard. Getting it right takes a lot more help at the start than you think. I don’t contact individual teachers about using machine learning in their course, and when a teacher contacts me, I start out my reply my telling them they’re about to be disappointed.
The only time that it benefits you to grade hundreds of examples by hand to train an automated scoring system is when you’re going to have to grade many hundreds more. Machine learning makes no sense in a creative writing context. It makes no sense in a seminar-style course with a handful of students working directly with teachers. However, machine learning has the opportunity to make massive in-roads for large-scale learning; for lecture hall courses where the same assignment is going out to 500 students at a time; for digital media producers who will be giving the same homework to students across the country and internationally, and so on.
It’s dangerous and irresponsible for edX to be claiming that 100 hand-graded examples is all that’s needed for high-performance machine learning. It’s wrong to claim that a single teacher in a classroom might be able to automate their curriculum with no outside help. That’s not only untrue; it will also lead to poor performance, and a bad first impression is going to turn off a lot of people to the entire field.
Myth #1. Automated grading gives professors a break
Look at what I’ve just described. Machine learning gives us a computer program that can be given an essay and, with fairly high confidence, make a solid guess at labeling the essay on a predefined scale. That label is based on its observation of hundreds of training examples that were hand-graded by humans, and you can point to specific, concrete features that it used for its decision, like seeing webbed feet in a picture and calling it a duck.
Let’s also say that you can get that level of educated estimation instantly – less than a second – and the cost is the same to an institution whether the system grades your essay once or continues to give a student feedback through ten drafts. How many drafts can a teacher read to help in revision and editing? I assure you, fewer than a tireless and always-available machine learning system.
We shouldn’t be thinking about this technology as replacing teachers. Instead, we should be thinking of all the places where students can use this information before it gets to the point of a final grade. How many teachers only assign essays on tests? How many students get no chance to write in earlier homework, because of how much time it would take to grade; how many are therefore confronted with something they don’t know how to do and haven’t practiced when it comes time to take an exam that matters?
Machine learning is evidence-based assessment. It’s not just producing a label of A, B, or F on an essay; it’s making a refined statistical estimation of every single feature that it pulls out of those texts. If this technology is to be used, then it shouldn’t be treated as a monolithic source of all knowledge; it should be forced to defend its decisions by making its assessment process transparent and informative to students. This technology isn’t replacing teachers; it’s enabling them to get students help, practice, and experience with writing that the education field has never seen before, and without machine learning technology, will never see.
Wrapping Up
“Can machine learning grade essays?” is a bad question. We know, statistically, that the algorithms we’ve trained work just as well as teachers for churning out a score on a 5-point scale. We know that occasionally it’ll make mistakes; however, more often than not, what the algorithms learn to do are reproduce the already questionable behavior of humans. If we’re relying on machine learning solely to automate the process of grading, to make it faster and cheaper and enable access, then sure. We can do that.
But think about this. Machine learning can assess students’ work instantly. The output of the system isn’t just a grade; it’s a comprehensive, statistical judgment of every single word, phrase, and sentence in a text. This isn’t an opaque judgment from an overworked TA; this is the result of specific analysis at a fine-grained level of detail that teachers with a red pen on a piece of paper would never be able to give. What if, instead of thinking about how this technology makes education cheaper, we think about how it can make education better? What if we lived in a world where students could get scaffolded, detailed feedback to every sentence that they write, as they’re writing it, and it doesn’t require any additional time from a teacher or a TA?
That’s the world that automated assessment is unlocking. edX made some aggressive claims about expanding accessibility because edX is an aggressive organization focused on expanding accessibility. To think that’s the only thing that this technology is capable of is a mistake. To write the technology off for the audacity of those claims is a mistake.
In my next few blog posts, I’ll be walking through more of how machine learning works, what it can be used for, and what it might look like in a real application. If you think there are specific things that ought to be elaborated on, say so! I’ll happily adjust what I write about to match the curiosities of the people reading.
Jonathan Rees says
Elijah,
Thank you for explaining these things. The more that everybody knows about exactly how these programs, the better our discussions about them will be. However, are you sure that the people who are currently pushing them (or eventually purchasing them, for that matter) really understand how they should best be used? In other words, where’s the politic dimension of this discussion?
Laura Gibbs says
Thank you for this explanation; I think many people do not realize the role of statistics here, and how robograding can detect patterns that work well in the aggregate over massive bodies of text (the more texts, the better). The phrase “statistical judgment” is very useful, highlighting the computer’s analytical strength (I cannot see things statistically), but also its weakness – because students really are not looking for statistical judgments on their work. They are looking for meaningful feedback from readers in response to what they have written. Statistical judgments can be very useful to administrators and to teachers; for students, they seem to me to have limited value at best.
So the crucial problem as I see it is the inability of the system to provide accurate and useful feedback on the level of a single piece of writing – non-statistical judgment as it were. And without accurate feedback, what good is this system really providing for the students who are doing the writing as part of a learning process? My goal is to help students – to help them with the content of their writing, writing mechanics, the writing process as a whole. Statistics are not really of any help to me or to my students. Indeed, there does not even appear to be a reliable grammar checker out there that can give my students accurate feedback to help them improve their writing mechanics (aside from spellcheck), much less to improve the substantive content of their writing. So while robograders seem to me useful for standardized grading and various types of assessment in the aggregate (like their use in the ACL test), I just don’t see how these programs will provide any useful contribution for the students who are seeking useful feedback in order to improve their writing. Perhaps your future articles will provide information about that.
And if you happen to know of a good grammar checker, I would indeed be VERY curious to learn what you recommend. Every semester I give my students a writing mechanics assessment, and every semester I test one or two grammar checkers with the same assessment. Without fail, the grammar checkers – including paid ones, not just free services – perform worse than even my worst students. Not a lot of value there for the writing instructor that I can see…
Laura Gibbs says
Plus, one more question to ask since you are entertaining questions for future articles!
If you want to use machine-grading, then why not just use machine-graded tests? Those can be accurately designed, and they can also be the basis for responsive testing environments (and tutoring environments!!!) that take advantage of the ability to personalize the test for each student while the test is being taken.
It seems to me FAR more useful to use machines and machine-graded testing in that way, with objective content, than to subject the messiness of human writing to machine grading. What is the point really? If there’s no time to do essays right, then let’s not do them at all. Let’s come up with really good machine-based tutorials and testing instead of trying to fit a square peg in a round hole, treating human writing as if it were a machine-gradable test.
Gio Wiederhold says
The article is useful and many of the points are well taken But I think Elijah, by using image recognition as an example, confuses rather than elucidates the issue.
Image matching is based on 2-D (sometimes 3-D0 presentation, often with color. Essays start out being linear.
And quality of the images is rarely an issue in matching.
Some familiar text, like the Gettysburg address, could have led to a more relevant discussion.
Gio
Michael Feldstein says
Laura, the link of “probably related posts” at the bottom of each e-Literate blog post is a statistical judgment that is hopefully useful to individuals. I don’t want to steal Elijah’s thunder for future posts, but in general, there are specific suggestions this kind of software could make that, when couched with the appropriate level of uncertainty, can suggest to students aspects of their writing that they could productively consider for revision.
Jonathan, I don’t think one needs to assume dark motives here (although there is no doubt that belief in the hype of this software is convenient for certain kinds of agendas). In general, there is a lot of hype and optimism around “machine learning” in general, and it is often not accompanied by real understanding of what the technology actually can and cannot do. I get into arguments all the time with honest, well-intentioned people, including teachers, who want to believe that there is a technological silver bullet.
Laura Gibbs says
Michael, what is an appropriate level of uncertainty when giving feedback to students? To my way of thinking, that is in fact the crucial problem. As a teacher, I can deal with statistical uncertainty (and hey, I’m not the one being graded), and administrators can deal with statistical uncertainty – how can I expect a student to take seriously feedback that has to be described as a ‘guess’…?
That’s the same problem I have with the grammar checkers. Even though they provide some useful feedback, the false positives are the kiss of death – you cannot give feedback to students that contains false positives (i.e. things are flagged as incorrect when they are actually fine). It’s one thing for professional writers to get such feedback (we can quickly weed out the wrong feedback). As feedback for learners, though, it is a disaster.
Michael Feldstein says
Most grammar checkers are not designed to be pedagogical tools, which I think is why they are problematic. Suppose you could give a students a tool that would say something like the following to them:
Would the imperfect accuracy be as big a problem in that case? Would it be worse than the imperfect accuracy students get from peer review? And to the degree that it is worse, is the problem the accuracy itself or the level of authority that students assume the machine has?
Elijah Mayfield says
To Jonathan – I agree that the people purchasing the software don’t know what they’re buying. Most people that I’ve spoken to imagine that they’re getting a glorified version of the green squiggles in Microsoft Word, which I would disagree with wholeheartedly and which might be the subject of its own blog post in the future. As for the people selling the product, I think the researchers obviously know what they’re doing, though it’s not clear to me how close the connection is between R&D and marketing.
The obvious goal on my end is to get awareness of the technology out there to as many people as possible, to help this situation. That’s the motivation behind writing this piece, my consulting with education companies, the conversations I’ve been having with universities, and so on. I don’t think it’s necessarily a top priority for other companies to spend their time explaining machine learning (though, like Michael said, I don’t ascribe this to malice; I think it’s just priorities). I’d like to be different in my emphasis on clarity and open explanation of what’s going on. Beyond that, I’m not a policy person and I can’t make immediate guesses as to what the landscape looks like after that awareness gets out there.
Elijah Mayfield says
To Laura – One big thing to note is that LightSIDE is not a grammar checker, and I would never claim that it can recognize things like verb conjugation, misplaced commas, and so on. That’s an entirely different type of natural language processing that mostly relies on phrase structure and dependency parsing, and it’s not something I’d ever claim LightSIDE is capable of automating on behalf of teachers. Others have invested time into it, but I haven’t.
Instead, LightSIDE is good at making holistic judgments – either of an overall grade, or for a larger-level concept. An example I often give comes from biology, where an automated grader is asked to judge whether a student understands the concept that “traits are heritable across generations.” This isn’t something that gets at writing mechanics – it gets at understanding, and it requires a fairly extensive representation of an answer in order to judge.
It comes as a surprise to most people when I say that LightSIDE is actually much better at the latter than the former. For syntax checking, you need to be perfect every time – otherwise you’re giving bad advice. For labeling a higher-level concept, though, you need to look at a thousand redundant features of an essay. If 800 of them point towards one label – mastery of the concept, for instance – and 200 point in the other direction, then the automated system can still predict with high confidence that the student understands the concept. Fully one fifth of the evidence was pointed in precisely the wrong direction, but on balance, it’s still able to make the right call.
The specific ways in which these machine learning features can be packaged up as feedback will be the subject of later, detailed blog posts on the topic.
Elijah Mayfield says
To Gio – I disagree that the difference between 2D, 3D, and linear input is as meaningful as you make it out to be. The input to a classifier is still going to be a feature vector, whether we’re extracting formants or term n-grams, and the algorithms are often linear in any case. While there are algorithms like multilayer neural nets that get used in image recognition but not text classification which take advantage of that nonlinearity, that’s far beyond the level of this metaphor.
If we’re looking for a classification task, it’s hard to say what exactly we’d be classifying with an example like the Gettysburg address. Author age? Gender? Date of writing? At that point, I don’t know why I wouldn’t just write about essays as the example classification task in the first place. By switching over to images as an example I tried to make it easier to conceptualize; hopefully it’s a start.
Tony says
great observations. (link to Lightside Labs is broken)
Laura Gibbs says
Oh sure, Michael – but can you make it work? It’s harder – WAY HARDER – than you think. Find me an example of software that is really accurate at identifying the subject of an English sentence and the verb and I’ll be impressed. The reason grammar checkers are so incredibly bad is that the task at hand is far harder than most people realize.
Plus, subject-verb agreement errors are NOT common among native speakers (more so among non-native speakers, it’s true) – in fact, they are downright uncommon, because they are not really part of the writing system; they are a natural part of speech itself. If someone is a native speaker of English, they almost always get subject-verb correct when they speak, and likewise when they write. The real problems people have with their writing are things that happen ONLY in writing – things like sentence fragments and run-on sentences, along with punctuation problems and spelling problems (homonyms and other word pairs the spellchecker cannot fix). Since sentence structure problems, punctuation problems, and spelling problems are very much context-based and meaning-driven, they are hardest of all for computers to detect. Those problems are pervasive in my students’ writing – while several weeks can go by in which I will not see a single subject-verb agreement error.
I’m curious what you think about my other question for Elijah – why are we messing around with writing when the results seem almost certain to be a disaster? If we don’t have the money for instructors and human assessment, then shouldn’t we just be satisfied with building really excellent machine-graded tests and machine-based tutorials…? Why waste our students’ time with writing at all – especially if the writing assignment is just being used as a test of content mastery anyway…?
Laura Gibbs says
Elijah, that’s what I expected – you are really just looking at content mastery, right? So why not just do machine-graded tests? Why bother with wasting the students’ time writing something that can be demonstrated even more clearly by means of a standardized test that is precisely designed to assess that mastery?
Elijah Mayfield says
The reason is because, like it or not, writing takes place in every aspect of a person’s career. When I work with research collaborators or am screening employees, the first and almost only thing that I look for is competence at expressing ideas and fluidly communicating. If students spend their entire academic lives taking machine graded multiple-choice tests, they’re going to graduate totally unequipped for perhaps the single most important career skill.
On the other hand, if we require teachers to provide feedback and help by scaffolding writing on every assignment, we’re going to be woefully disappointed. Teachers don’t have the time or energy to do that for every student, especially not in underprivileged areas (this is doubly true for international locations, where even the teachers might not be ready to demonstrate writing mastery in English). To me, it makes no sense to have tools like LightSIDE, which put together the results of decades of research in language technologies, and reserve them for measuring sentiment of tweets and mining financial data from SEC reports (both of which are highly lucrative modern applications of natural language processing).
I see an opportunity to change the way writing is taught, using automated assessment as the backbone of the revision, editing, and feedback cycle that works best for getting students to learn to write. It’d be a travesty not to take advantage of that opportunity.
Laura Gibbs says
Well then, Elijah, you are NOT talking about content mastery – you are talking about the quality of the writing and actual writing instruction. I agree with you that writing is a very important job skill and that is why I teach writing; somebody needs to! I work with appx .100 students per semester, and it’s my full-time job. Making a difference in the lives of those students is rewarding for me, and it’s not all that expensive for the students either – my salary costs run about $300/student/semester (although my university charges them $1000 for the course – you’d have to confer with the university administration about where the rest of that money goes).
Anyway, I guess I will have to wait to see what kind of feedback your system will provide that actually helps students to improve their writing. I’m especially surprised you think you can do that if, as you admit, your system is not able to attend to writing mechanics, which is what most people who need to work on their writing actually need to help with. You say your software is going to help people with their writing, but with no syntax checking? I’ll believe it when I see it.
Mark says
I can see where a program can come to recognize patterns. But I just commented on some student drafts for a book review assignment. The things I am concerned with: Does the student correctly identify the author’s thesis? Does the student choose meaningful quotations that clearly illustrate the author’s thesis? Does the review itself have a good thesis? Is the review truly an analysis or just a summary of the book? Can any machine learn to evaluate any of those things?
Elijah Mayfield says
Well, syntax definitely comes into play – as I’ll detail in future posts, LightSIDE makes heavy use of syntactic features when evaluating essays. What I don’t believe it can do is assuring syntax “correctness” – being able to single out particular errors and correcting them with 100% accuracy. Instead, with syntax as with vocabulary, LightSIDE will make a balanced assessment based on the features it sees, and the feedback that it returns will be a weighted sum of the evidence, rather than picking out particular verb conjugations, for instance.
Research into peer review and collaborative writing shows that without scaffolding, students are likely to give each other bad, unhelpful advice – swapping out simple vocabulary words for harder alternatives and so on. This doesn’t lead to learning and it doesn’t lead to knowledge transfer, either. Our goal with LightSIDE’s feedback isn’t to suggest particular vocabulary swaps and it isn’t to make spot corrections of low-level mechanics; it’s to highlight the strongest and weakest portions of a text and help students improve their own ability.
I’ll try and detail the ways in which I envision that happening in a followup post, eventually.
(Hat tip to my colleague and friend Samantha [http://www.samanthafinkelstein.com/] for guiding me through the learning sciences. My background is in machine learning and I’m only mildly well-versed in that body of research)
Laura Gibbs says
I will be very curious indeed to see the results and I thank you for being willing to share that publicly here. As a writing teacher who is interested in linguistics, technology and pedagogy, I welcome the chance to learn more about your work – and you’ll pardon my initial skepticism, I hope. I will certainly read your future posts with great interest.
Michael Feldstein says
Laura, my wife taught ESL sections of English Comp for years, so subject/verb agreement is something we talked a lot about and is an example that leaps to mind quickly for me personally. But I think you’re missing the point. Peer review gives imperfect feedback, but we use it all the time in composition classes. Why not ask, “If we don’t have the money for instructors and [expert] assessment, then shouldn’t we just be satisfied with building really excellent machine-graded tests and machine-based tutorials…?” There are a couple of answers to this question. One answer, which holds true for peer review but does not hold true for machine grading, is that peer review is good pedagogically for the reviewer as well as the reviewee. But another, equally valid answer is that peer feedback is often better than the alternative of no feedback, which is the alternative that many composition teachers are faced with in reality. And peer review can sometimes help students correct basic problems in their drafts before they get to the teacher, so that the teacher can focus her comments on those areas that really require expert feedback. Such a redistribution of certain kinds of feedback doesn’t undermine a teacher’s authority. Rather, it maximizes a teacher’s effectiveness but enabling the teacher to focus on the problems on the problems that are hardest for the student to learn to correct.
You seem to think that Elijah is suggesting teachers can or should be replaced with machine grading. But I don’t think he is suggesting any such thing, and I am certainly not suggesting any such thing. Rather, the question being asked is, “Given a realistic understanding of what this software does, what is it good for, pedagogically speaking?” That is a worthy question, I think.
Your response to my grammar checker question, was “Oh sure, Michael – but can you make it work?” But I wasn’t suggesting that the software needs to be any more accurate than grammar checkers are today. Rather, I was suggesting that the problem with grammar checkers for students is that the grammar checkers provide simplistic and falsely authoritative prescriptions that can be wrong. If you keep the “can be wrong part,” you don’t change it at all, but you remove the “simplistic and authoritative prescription” part, I contend that you would have a useful tool for students that does not replace a good teacher but does provide a useful supplement, in a similar way that peer review can be a useful supplement to teacher feedback.
Laura Gibbs says
Michael, I ask my students to give peer review (it’s an essential part of my class), and I make sure to model that process for them, while also being fully aware that they are only going to be able to provide feedback about content – not about writing mechanics (with occasional exceptions, of course). So I’m all in favor of well-constructed peer review systems – I could not manage to teach my classes without it. Feedback from peers is highly motivating for the students, and students also learn a great deal from reading each other’s writing. I do the feedback side by side with my students based on the flow of writing that comes in to me, which is really a result of the students’ schedules (hectic as they are) – my feedback is continuous, peer feedback is continuous, and that has proved to be a very efficient process for me, esp. teaching online.
Meanwhile, I am waiting for someone to show me a real example of this “useful supplement” provided by the computer that is responding to natural human language use – I understand what you want it to be, but I would contend that natural human language use is so complex (complex for a computer to apprehend) that trying to give writing mechanics feedback on spontaneously generated student writing will lead only to confusion for the students. I will be glad to be proven wrong about that. So, please, prove me wrong!
As for myself, I am going to invest my time and energy in helping to develop useful tutorials for students – really good interactive, machine-graded tutorials to assist students in improving their basic writing mechanics. That’s what I most need help with as a teacher – not assessment, but instead guided practice for students to complete on their own. I’m really impressed at what the guys are doing with Essential Grammar and I will be working hard for them all summer – their project strikes me as the single most worthwhile application of machine intelligence to the teaching of writing I have ever seen, while this robograding, roboassessment, whatever we want to call it, strikes me a total chimera.
But I will be glad to be proven otherwise! And believe me, I am the kind of person very ready to admit I am wrong – especially if, as in this case, my being wrong will be very good news for the students!
Laura Gibbs says
Whoops – that’s Empirical Grammar. I don’t think we can edit comments here can we? Alas. 🙂
Michael Feldstein says
Fair enough. Let’s see what Elijah can show us (and teach us) in future posts.
On the editing of comments, I’m afraid that’s a trade-off. I don’t require registration on the site for commenting, which means that the barrier to commenting is as low as possible. But it also means that the system doesn’t know who is the author of—and therefore entitled to edit—any given comment. Sorry about that.
Jeroen Fransen (@jeroenjeremy) says
Excellent to see this discussion out in the open. It might be worth it to point out that essay evaluation, and the degree to which you can automate it, depends also on the educational system (=the market). For example, in the US a holistic method of evaluation is accepted whereas in the Netherlands it is not.
I would love to get in contact with Elijah to see how we can combine the LightSide train of thought concerning grading with the hybrid approach we take at Joyrite (teacher + technology = $6M man/woman).
Gio Wiederhold says
Jeroen
Can you explain what you mean by `in The Netherlands holistic grading is not accepted’.
I went to school in Holland in the 50ties, and the teachers I had at the Grotius Lyceum in The Hague evaluated our work very thoroughly for sure.
Gio
Steve says
I sympathize with you, Elijah. I participated in both phases of ASAP and did pretty well. I intentionally imposed some pretty stringent constraints on my algorithms to highlight that the ASAP competitions were not necessarily going to produce algorithms that appropriately graded essays. I wrote this after the first phase:
http://blog.lexile.com/2012/07/some-notable-flaws-in-auto-essay-scoring-engines/
Ada Penske says
Did we read the same piece? You accuse the NY Times of sensational journalism. Really?
The story reports claims made by some reputable organizations (e.g. MIT, Harvard). But the piece also notes that there are skeptics, including several also at MIT. It presents links to both sides of the argument should the reader be interested in following up.
Is the article informative? Yes. I learned something about what EdX and others are trying to do.
Is the article balanced? Yes. It identifies skeptics. I learned therefore that there is a debate.
Are there factual errors? You haven’t pointed to any. What is it that the story got wrong factually?
I also think you are unfair to EdX but that will be discussed in a separate post.
Ada Penske says
Lets look at each of these “Myths” one at a time. Myth #6 is a Red Herring.
“Nothing will ever puzzle me like the way journalists require machine learning to behave like a human. When we talk about machine learning “reading” essays, we’re already on the losing side of an argument.”
Can you point us to where in the NY Times story the journalist takes this position?
Ada Penske says
Myth #5 is also a Red Herring. “Automated grading is “grading” essays at all”. First of all, you must mean “Automated grading is (NOT) “grading” essays at all. You go on to explain that robograding is an exercise in classification. But who claimed otherwise? Neither the journalist nor EdX claimed otherwise. Who is guilty of this myth then?
Laura Gibbs says
Steve, thank you for the link to that very informative blog post (http://blog.lexile.com/2012/07/some-notable-flaws-in-auto-essay-scoring-engines/). I was especially struck by the paragraph I’ve quoted below. The concerns you raise here are very much the same concerns that I have about the use of robograding; I can understand the use of statistical description to come up with grades (I’m not a big fan, but I’m willing to concede it is a necessary evil in our standardized-test-obsessed educational system), but given that constructive feedback needs to respond to the actual content of a piece of writing (i.e. the meaning that the author is conveying, or seeking to convey), I just don’t see how machines will be able to provide meaningful feedback. I am waiting Elijah’s future posts to see what kind of meaningful feedback will be offered (and feedback was also promised by the edX folks in the NYTimes article).
quote: “The problem with my AES engine: you could just write a bunch of random characters and eventually you would write enough to get a perfect score. Or you could write down a bunch of words that were related to the prompt that you thought other students would use and place them in whatever nonsensical order you wanted. You needn’t worry about capitalization, punctuation, or anything like that and you certainly don’t need to worry about whether what you were writing was factually correct. Unfortunately, other well-established AES engines fall prey to being easily “gameable” too, despite the fact that they are using much more refined natural language processing (NLP) techniques.”
Debbie Morrison says
Thank you Elijah for this in depth post. Questions I have, how do students perceive machine grading? And how much research has been done on the impact on learning performance and motivation?
I wonder what the implications are (or will be) on student motivation, and quality of their effort and work? Students spend time on writing essays, some more than others, yet for students to know that a real person will not be reading their essay, could impact many processes. My teenagers have been exposed to automated grading periodically at their high school and they both strongly dislike it (despise it is a more fitting term). They discount its value completely. I predict that teenagers and young college students will not be receptive to this type of grading. Why should they spend hours researching, writing and re-writing an essay when they know no one ( a real person) will even read it? Even more so in a MOOC that is not for credit, why on earth would you write an essay for an automated grader?
For large scale classes, as you discuss in your post, peer grading would be a far more valuable exercise and learning experience for students than machine grading. Two studies I have read show that there is 20 to 25% grade inflation with peer grading, but the learning for sides, peer and student is far more meaningful in my opinion.
I am all for technological advancements, yet at some point are we not going too far, and when will that be? (A rhetorical question).
However, I do look forward to reading further and learning more about this method.
Thank you for the thought provoking post. Debbie
Ada Penske says
Myth #4, or the way it’s stated, is plain weird. “Automatic grading punishes creativity.” “This is the assumption everyone makes about automated grading.” Like who? Everyone? Really?
But then we switch gears and it turns out that it is not everyone after all. Edx is not guilty of this myth. But wait. They are guilty after all. Guilty of what? Of not having taken “head-on” the topic of creativity. Elijah, therefore, will take it “head-on” on their behalf.
What do we get? A rather convoluted explanation once again of some basics of machine learning and classification. How is the explanation tied back to creativity? It isn’t. We are left to wonder.
I noted above in an earlier comment that the criticism of EdX is unfair. Here we have a glaring example.
Suppose I create a detector that can distinguish between red wine and white wine and also some varietals (e.g cabernet, merlot). But my detector can’t distinguish between inexpensive wines and expensive wines. Wouldn’t it be unfair for someone to criticize me for not having solved “head-on” a problem I was not trying to solve? And then wouldn’t it be doubly unfair when the criticism provides no insight whatsoever into tackling this other problem of inexpensive vs expensive wines?
I don’t know what EdX had in mind but it seems eminently reasonable for them to approach this from the perspective of machine learning to classify “A”s, “B”s, etc because it is likely to be a bounded problem. It is also reasonable to pass on classifying “creative” vs “un-creative” papers because the term is so slippery.
What is EdX guilty of then? And what insight has Elijah provided about the “creativity” problem. Beats me.
Ada Penske says
Next we come to Myth #3. Up to this point there is no evidence whatsoever that EdX subscribes to or perpetuates any of these myths. The same is true about Myth #3. Who is guilty of Myth #3? I am sure there are people don’t know how machine learning works. But there are clueless people all around. Why take EdX to task for the human condition?
Michael Feldstein says
Ada, I am going to have to ask you to watch your tone. If you take issue with the substance of a post, that’s fine. But I need you to do so respectfully. Also, if you feel the need to write such an extended critique that it takes you multiple comment posts without response from the author, then maybe you should think about posting on a blog of your own and linking. The comments thread is for dialog, and neither your tone nor your posting length is inviting dialog right now.
Ada Penske says
Elijah has made a series of assertions. I am challenging those assertions. There was nothing disrespectful about my tone.
What is the purpose of your blog? An opinion column? Or, do you and your authors expect to provide evidence for your claims? The same thing happened when I questioned some of your assertions in an earlier post. Instead of offering a rebuttal you told me to go away. This time I will. This will be my last post on your blog.
When you wade into technical areas and make questionable claims how do you expect someone to respond if they either disagree with you or seek clarification? You are allowed to take as much space as you need but when someone wants to offer a rebuttal they need to do so in sound bites? I have tried to be as concise as possible to make the point. Of course, it’s your blog. And it’s your prerogative to shut down discourse.
I encourage your readers to go back and re-read the NY Times article and compare it to Elijah’s characterization. I believe on the whole the NY Times piece is balanced and not anywhere near how Elijah characterizes it. But let your readers be the judge.
But the more important point is the utter mischaracterization of edX. It is plain wrong to suggest that edX subscribes to or is encouraging any of the six myths.
The only “myth” that has some basis in the NY Times piece is Myth #2 “Automated grading only requires 100 training examples.” “It’s dangerous and irresponsible for edX to be claiming that 100 hand-graded examples is all that’s needed for high-performance machine learning.”
A critical reader might pause and reflect for a moment before leveling this type of charge. The people working on edX are not stupid. Anant Agarwal, who made the claim, is not stupid. He was previously Director of CSAIL at MIT. Pause and give the benefit of the doubt. What he have meant, assuming that the report is accurate? Agarwal is not stupid enough to believe that one can create an accurate machine learning algorithm from scratch based on a training sample of 100. Elijah simply assumes this.
I think we can come up with an interpretation that makes sense. I am not an expert on machine learning. But it seems plausible that what he meant was the 100 essays are used for calibration purposes for a specific instructor. It’s not used to bootstrap the entire algorithm from scratch. Isn’t that the kind of thing that happens when a specific user starts using hand writing recognition software? The recognition algorithm is not generated from scratch. There is software already that has done quite a bit of the heavy lifting. The sample of 100 is probably for calibration.
But who knows? Maybe the people at edX and Agarwal, a world-class computer scientist, are in fact incredibly naive and stupid about machine learning and how it works.
Laura Gibbs says
Ada’s comments prompted me to go back and re-read the NYTimes article – what is of real interest to me here is the FEEDBACK issue (not grading; for humans, grading is easy while feedback is hard, and the same is even more true for computers). I’ve copied the relevant claims about feedback from the NYTimes article below, and Elijah has also made his own claims about feedback here on this page. I am very curious to see what evidence we will get to see in future posts of computers that provide useful and reliable feedback to students, feedback that will help students to improve their writing and their critical engagement with the course material.
Also, I think Debbie’s points about motivation are very well taken… I personally find the idea of writing for a robograder completely demoralizing, although I assume – perhaps wrongly – that this machine-generated feedback will be complemented by a peer feedback system also. Peer feedback as I experienced it in a Coursera course ranged from okay to marginal to appalling (details here: http://courserafantasy.blogspot.com/2012/08/peer-feedback-good-bad-and-ugly.html) – but I am doubtful whether computer feedback will even be able to rank as okay.
My inclination as a teacher is to have the machines do the grading (if there must be grades) and to have the peers do the feedback, perhaps even commenting on whether they agree or not with the computerized assessment. Now that might actually be quite useful for everyone concerned…
Anyway, here are the quotes from the New York Times article re: feedback claims:
Anant Agarwal, an electrical engineer who is president of EdX, predicted that the instant-grading software would be a useful pedagogical tool, enabling students to take tests and write essays over and over and improve the quality of their answers. He said the technology would offer distinct advantages over the traditional classroom system, where students often wait days or weeks for grades. “There is a huge value in learning with instant feedback,” Dr. Agarwal said. “Students are telling us they learn much better with instant feedback.” […] It will also provide general feedback, like telling a student whether an answer was on topic or not. […] “It allows students to get immediate feedback on their work, so that learning turns into a game, with students naturally gravitating toward resubmitting the work until they get it right,” said Daphne Koller, a computer scientist and a founder of Coursera. […] “One of our focuses is to help kids learn how to think critically,” said Victor Vuchic, a program officer at the Hewlett Foundation.
Jeroen Fransen (@jeroenjeremy) says
@Gio You’ll be happy to know that essay evaluation in Holland is still very thorough. However, unlike some countries where there is a top down standardization concerning the writing (e.g. 5 paragraph essays) and grading process, in the Netherlands basically each teacher defines how to grade the written work and provide feedback on it.
What I actually meant to say in the previous comment is that currently one part of the teachers in the Netherlands grade essays by finding the number of errors in the categories of Language, Structure and Content and tabulating it into a final result. The other part, mostly experienced teachers, read the essay and provide the grade on their trained impression. This last group tends to give less detailed feedback to students and applies what I would call a holistic method.
The actual grade of an essay is certainly important for high stakes situations but what students need all year long is feedback that they can actually learn from (as Laura Gibbs also pointed out in her comments).
Myself I’m focused on providing an online solution to support teachers so they can provide grade and feedback with less work while adding transparency. In order for the result to be of high quality we compromised on the silver bullet factor and created a hybrid of teacher with language technology and AI.
Michael Feldstein says
Ada, I did not ask you (or anyone else) not to argue with Elijah. I asked you to keep your tone civil and to keep your answers short and focused enough so that there can be a conversation here. These are, I think, fairly standard expectations for conversation, whether online or offline.
You want to challenge Elijah on whether the claims he seems to be attributing to the NYT and edX are rightly attributed to them. That is a fair question. Laura has taken up your question and similarly challenged the post without the cutting rhetoric. Debbie and others have also raised objections. I am happy to have these questions raised. I object to not to your raising concerns but to the way that you are raising them, both here and in the other thread to which you refer.
I don’t think it is productive to take this thread further astray to give you a line-by-line analysis of your comments or get into a debate about what is civil and conducive to conversation and what is not. I am asking you to reflect on what you have written. If you think my request is unreasonable and decide not to comment here anymore, you are entitled to make that choice.
Michael Feldstein says
So far, I’m hearing a few strands of questions raised by Laura, Ada, and Debbie (and, indirectly, by Steve). I understand them to be roughly as follows:
Is this a reasonable summary?
Elijah Mayfield says
Steve – thanks for the link!
Debbie – There are mixed results in the literature, but most of all they point to a negative impression from students if they’re working purely alone, even if writing skill does go up. However, if automated technology is being used in a collaborative setting, scaffolding the interaction, we see almost the opposite effect – compared to a control it increases student satisfaction with the learning experience, and their own self-efficacy, even if the learning gains on top of that collaborative process are modest.
I don’t think there’s a silver bullet right now about the “right” way to use technology to support learning. I think there are a lot of potential avenues that people are going down, but one of the bigger bottlenecks is that the people designing the educational experience, who understand students and pedagogy, aren’t the same people as those who understand the technology, what it’s capable of, and what its limits are. My hope with LightSIDE is to bridge that gap and get the technology into the hands of people that could use it, get them to understand the tool that’s available to them, and improve education for the better as a result.
I’m not sure that I have the background to write a long-form blog post like this one if the topic was specifically on student reactions in the classroom. While I’m able to talk about that, by far my stronger field of expertise is in explaining the technology and getting it in the hands of educators. Still, I’ll see what I can do.
Elijah Mayfield says
Ada – I’m not going to reply to your whole series of posts because it’s too many questions for the format of a comment thread on a blog post. However, your observations about framing are useful and I’ll certainly keep them in mind for future posts on these topics.
Overall, what you’re forgetting is the target audience of these posts, both the NYT piece and my own. The edX article (by its nature as a general overview of the field to an audience that isn’t focused on education) makes several sweeping generalizations about the role of automated scoring in the curriculum. This gives a broad-strokes overview of what can be done, which is good, but if teachers are reading it and trying to picture how it might facilitate actual teaching, the piece was tone-deaf. The impression that it gives off is that teachers are no longer necessary, as embodied by the particularly brash headline about giving professors a break.
My response isn’t aimed at MIT’s machine learning researchers – if they needed an explanation on machine scoring they certainly shouldn’t be getting it from a blog post. If you want writing aimed at the technical state-of-the-art researchers, look at conference papers that I’ve written, and you’ll see that it’s both incredibly informative (if you know what you’re reading) and totally useless (if you aren’t already an expert in the field). Similarly, my post isn’t aimed at Dr. Agarwal – it might get to him at some point, but I’m sure edX has a vision and are likely to be sticking to it.
Three days in, this post has been read over 1,000 times. There’s a bigger audience that ought to understand what’s going on. If all you read are general-press writeups of the technology, you’re going to walk away feeling rightly marginalized by Futurist rhetoric, and teachers aren’t going to want to play any part in it. What I want to do is make sure that as many people as possible know the technology, to the greatest extent possible, so that rather than being ignored from the start, there’s a thoughtful discourse about how to use it in schools.
Clearly, you have some notion of how machine learning works. To you, it’s not only obvious, but insultingly obvious, that this is what’s being done under the hood. Do you really think that’s true of everyone in the education industry? Your rhetoric comes off as aloof and contentious, and I don’t think it’s helpful for raising awareness.
Laura Gibbs says
Michael, I think that is a great summary. If these topics are covered in detail in upcoming posts, I think that will be a really valuable conversation that will address questions of wide concern.
Elijah, I share your concerns about the gap between teachers and technologists. I’m not really clear on how you are bridging the gap; just from your comments here and the way you communicate with teachers at the LightSIDE website, it sounds pretty one-way and top-down, where you are creating a product driven by the technology and its potential, rather than something done in a collaborative way with teachers – and I can understand why that would happen even more naturally with your product than with other products since your product cannot really be used in a normal class, right, but only on a massive scale? Yet even if your product cannot be used in a normal classroom, it seems to me you would benefit from working closely with teachers and the actual human feedback systems they use in order to have a close working knowledge of how teachers create their feedback. Ideally, you would be working with students also – and hopefully international students, since the student populations of the MOOCs are amazingly international (for all that I have my doubts any massive class, the international community aspect of really fabulous!), and that adds an additional dimension to the feedback challenge. So if, as you say, you don’t feel confident writing the blog post about student reactions, then maybe you have a teacher collaborator there at LightSIDE who could provide a post like that?
Debbie Morrison says
Michael, yes – you hit all the key points (thanks) What I see as a critical component to effective use of machine grading is the perspective of the users – both student and teachers in this instance. What the machine can do is one dimensional, it grades essays on certain components of students writing, with a high degree of accuracy. However, then it gets complicated when considering the implications as you have highlighted in your bullets, and a couple more with regard to the implementation of machine grading.
1) Knowing the teachers will be resistant, how can teachers be prompted to view and use machine grading software as a tool (to augment) and not as a replacement to grading student work?
2) How students be prompted see the value of the feedback as a tool to improve certain aspects of their writing, and value its feedback?
I don’t see machine grading as a replacement for the teacher, but as mentioned, as a tool in the grading process, yet the perception is that the machine grader will replace real-person feedback, which perhaps is how it will be used, which in that case will not at all effective in the long term.
Laura Gibbs says
Debbie, I’m not sure that the kind of software Elijah is describing will ever be used by teachers in the way you describe – this software will only work when there is a massive corpus of student writing for the machine to learn from; so, unless you are talking about a group of teachers all using the same writing prompt and expecting the get the same answers from their students, this software does not really apply (see the comments addressed to individual teachers at the LightSIDE site).
This type of machine-learning software was originally developed for use in standardized testing, where the use of human graders is expensive and time-consuming and, according to some, undesirable (humans are not standardized in the way a computer can be). Since previously we did not have massive classes comparable to massive standardized testing situations, the applicability of such software to actual classes was kind of a moot point. No longer: now we have massive classes without actual instructors to provide feedback, hence a potential new use for this software, for massive classes where you have students + software but no teachers. I am not optimistic.
The kind of software you envision is more like WriteCheck and similar stuff being promoted by TurnItIn. My testing of WriteCheck showed it to be so full of errors that it was worse than useless. Even if it were free (and it is not), I could not recommend it to my students; details here: https://plus.google.com/111474406259561102151/posts/gVHAbR3KCb1
Debbie Morrison says
Hi Laura, Thanks for this info and the link to your post on Google +. This helps me get a handle on it better :).
I got off track when I was trying to flesh out the details of Elijah’s article, I read the post he linked to in his post, “Better Tests, More Writing, Deeper Learning”, (gettingsmart.com/cms/blog/2012/04/better-tests-more-writing-deeper-learning/) where I got the impression that the development of automated grading software would be used as tool (as per quote below). The post describes the grading software which Elijah referred to as part of the project he was involved in with Hewlett Foundation funded Automated Student Assessment Prize (ASAP).
“Automated essay scoring will allow teachers to assign 1500 words a week instead of 1500 words a semester. With the help of an essay grader, students get frequent standards-based feedback on key writing traits while teachers are able to focus on voice, narrative, character development, or the logic of an argument.”
and this one:
“Dr. Shermis agreed that automated scoring is designed to be a “teachers helper” as Laura suggested and would allow a teacher to make more writing assignments.”
But as your comments states, it appears that with the MOOCs phenomenon, the game has changed, so the purpose of the software is not be used as a tool, but as a replacement for teacher or even peer grading.
Laura Gibbs says
Right, Debbie – in fact, I think we need some help with vocabulary to distinguish between these two very different things. I’m not sure what the technical term is that distinguishes these two different types of software!
1) On the one hand, there are grammar checkers and other content-neutral tools that help teachers grade student writing by attending (or, I would say, claiming to attend!) to some of the writing basics. Turnitin’s WriteCheck falls into that category, and there are other such software programs out there. (Personally, I find this software to be completely unreliable; I am much more interested in tutorial software to help students diagnose deficits and practice their skills – not in purported autograding.) These types of software programs do not necessarily learn as they go – instead, they are built with a certain set of fixed algorithms that process text, looking for predefined features that match the detection set. These are ready to use “off the shelf” as it were; they do not have to be calibrated to each assignment with a set of human-graded samples. Turnitin has packaged WriteCheck with their plagiarism detection software and is selling that as a package deal to schools so that teachers (not students) can use the data returned by WriteCheck as they grade and/or respond to student writing (and heaven help them, I say!). Students who want to use Turnitin’s plagiarism detection and grammar checker have to pay separately from their schools via the Turnitin website – yes, Turnitin is like an arms dealer that is supplying both sides in a civil war…
2) Then there are other kinds of software, much more innovative and admittedly very cool from a technical point of view, which are focused on having the machine actually LEARN to respond to the content of the essays. For such software to work, there has to be a massive corpus, with a fairly sizable number of essays scored by humans – the machines then learn to recognize the patterns in the essays scored by humans and to match those patterns in the massive corpus of essays written in response to the exact same prompt. That is the kind of software Elijah describes in his post, and here is what the LightSIDE website says if you are “a teacher interested in new tools for the classroom” – “What you need to do: It’s likely that you won’t be able to compile enough training data from your own classes to train LightSIDE. However, if the same course is being taught across sections, at different locations, or throughout a district or campus, then you might just be able to make it. Contact organizers at a higher level of your institution to try and collect sufficient training examples for the questions you’re interested in assigning.”
Michael Feldstein says
Yes, one of the areas that interests me is the potential for crowd-sourcing the training of a writing prompt. It wouldn’t have to be all at one department or institution, and it wouldn’t have to be part of a MOOC. For example, imagine a library of OER writing prompts around Huckleberry Finn. Teacher could use the ones they like individually. They could also participate in a group effort to train the machine feedback tool by agreeing to manually score some student essays using an agreed upon rubric. After a time, you could have the machine ready to go.
Now, there are a bunch of challenges with this approach, before you even get to the machine doing its part. Here are a few:
But I think the payoff could be giving students more writing practice with immediate feedback. One of the pedagogical principles that the early MOOC research has affirmed is that the immediacy of the feedback matters. You want to bring the application of a skill as close to the teaching of the skill and the feedback on the application as close to the student’s work as possible. So if the goal is to give students lots of writing practice with feedback and set aside assessment as a goal altogether for these assignments, then this could be an interesting approach—provided, of course, that the feedback is accurate and presented to the students in a way that they find to be helpful.
Debbie Morrison says
Hi Laura! Ahh you are golden – thanks for making it clear!
I agree with you, it seems clarification on what the teacher’s role is in terms of using the software would be helpful: as with articles and posts written like the one I mentioned, that allude to the machine grading (type 2) as a tool to support teacher grading. Even in Elijah’s post when discussing myth #1 he suggests this with this statement, that the teacher is still involved in the process:
“This technology isn’t replacing teachers; it’s enabling them to get students help, practice, and experience with writing that the education field has never seen before, and without machine learning technology, will never see.”
I look forward to reading more of the posts from Elijah and learning more.
Thanks Laura! 🙂
Laura Gibbs says
Michael, exactly – and that is a HUGE proviso (“provided, of course, that the feedback is accurate and presented to the students in a way that they find to be helpful”). I am completely assured that grades can be assigned quickly – but teachers can assign grades very very very quickly too. If all I have to do is write A-B-C-D-F or the equivalent on a paper, that takes me a couple of minutes. The bottleneck for the teachers is providing quality feedback, not assigning a grade. Until we see that there is valuable assistance that the computers can provide in the way of feedback, speculating about crowdsourcing to standardize our writing prompts and then to teach the machine is counting our proverbial chickens long before they have hatched.
Edward M. White says
Thank you for rebuking Ada’s aggressive and contentious tone, so common and destructive in academia. We do not need that here. I find the article and the discussion most helpful in clarifying what computers can and cannot do for the teaching of writing. From the start, the gulf between what the promoters and salespeople have been saying and what the serious researchers modestly claim has created much confusion. I appreciate the clarity and professional tone here.
Chris M. Anson says
Elijah: Thanks for your post. Analogies are good, and I appreciate the duck/house one because of the way it helps to explain the underlying processes that computers can be programmed to use. But what’s missing from much of the analysis here is the relationship between writing and underlying meaning. Computers may be able to “learn” external features of writing that are associated with human-provided scores. For example, it’s easy to see that, if human readers give higher scores to essays of students who use multisyllabic, Latinate words, a computer could add this tendency to a predictive algorithm for new, unscored essays. The computer is “looking” for surface-level linguistic features, but it is entirely unable to make judgments on the underlying meaning those surface features yield.
So consider this further example. Imagine that we feed 100 already scored essays into a computer. The essays are all written in response to a prompt imagining what it would be like to go on a vacation to the Bahamas. The computer creates a predictive matrix based on various (mostly surface) features of the sample. Some essays have punctuation and spelling errors. Some use very simple words indicating a limited vocabulary. Some are very short and lack detail (so that if such short essays are typically scored lower, the computer will learn to look for this feature in its matrix). Some use more sophisticated vocabulary and are scored higher. Let’s assume that all the essays make sense in terms of their meaning—that is, there’s nothing particularly illogical or wrong about the assertions. So far so good.
Now we feed 100 new, unscored essays into the computer, all identical except for one line. For purposes of this illustration, let’s imagine that half the essays contain the same line: “I couldn’t wait to board the plane to go to the Bahamas.” Twenty more essays have the line “I couldn’t wait to board the boat to go to the Bahamas.” Ten more essays have the line “I couldn’t wait to board the train to go to the Bahamas,” and ten more have “whale” instead of “train.” The final ten have the line, “I couldn’t wait to embark upon the cetacean to sojourn to the Bahamas.” Here’s what will happen: first, if the computer has created some predictive features, based on scored essays, that associate vocabulary such as “embark,” “cetacean,” and “sojourn” with stronger performances, it will score all of the final ten papers (all other things being equal) as higher than the other 90. However, there’s nothing in the predictive feature matrix to tell the computer that taking a whale or a train to the Bahamas is illogical. So on this feature, it will score all the transportation references the same way. Of the 30 “faulty” essays (train, whale low vocabulary, whale high vocabulary), ten (whale high vocabulary) will be scored higher than the rest, but the remaining 90 will be scored the same even though 20 of them (train, whale low vocabulary) are illogical.
Now let’s extend the features a little beyond the one line in question. Imagine that five of the whale essays have some other nuances that suggest the writers are being ironic or sarcastic or creative. The computer can’t tell the difference between an intentional and unintentional error in meaning even if it were programmed to look specifically for cases in which something other than “plane” or “boat” were used as a means of transportation. In that case, it would downgrade the most creative essays under the assumption that they were wrong. Or imagine that some essays use the phrase “I couldn’t wait” to refer to their eager anticipation while others use it literally, to mean that they we unable to wait, let’s say, to board the plane because they were having gastrointestinal distress, went to the bathroom, and missed the flight. The computer—so smart yet so, so stupid—doesn’t know the difference. Imagine that an essay contains the line “My uncle kicked the bucket the week before and I was so looking forward to getting on that plane to the Bahamas.” If the idiomatic expression “kick the bucket hasn’t come up in the human-scored essays, the computer won’t know that it may be considered too slangy and will overlook it. It also won’t be able to make an inference that the writer’s eagerness for a vacation was associated with her or his emotional distress (as opposed to, “My parents kicked me out and I was so looking forward . . . “ or “I was kicking around the house when I learned I would be going on a vacation,” or “I had such a kick learning that I’d be going on a trip”). And on, and on, and on.
Now do a thought experiment: expand this highly limited domain of meaning (taking a means of transportation to the Bahamas) infinitely, into every possible nook and cranny of the vastness of human knowledge. Computers are simply unable to work with meaning beyond what they’ve been programmed to “know,” which is very little. This is what early researchers found at the Yale Artificial Intelligence labs in the 1970’s and 1980’s—here I urge readers to look at work by Shank and Abelson, particularly their groundbreaking book, Scripts, Plans, Goals, and Understanding. Unlike many machine-scoring aficionados, these scholars fully understood the nature of human language and its relationship to underlying meaning. Little progress has been made to rectify the serious problems they encountered trying to make machines work with language “naturally,” in the way that humans work with it. You can program a machine to look for cases in which students have used a specific word in biology (as opposed to a more general word) as an index of understanding, but it’s not predictive of real understanding. It won’t—it can’t—discern whether a student really understands a concept or is just using words that are associated with it. One of my friends at MIT claims that the AI researchers he knows—these are among the most brilliant AI people in the world—just laugh at the prospect that computers can do anything really meaningful with actual meaning conveyed through writing.
I made these points more extensively, with much more about the concepts of schemas, in Anson, Chris M. “Can’t Touch This: Reflections on the Servitude of Computers as Readers.” Machine Scoring of Student Essays: Truth and Consequences. Ed. Patricia Freitag Ericsson and Richard Haswell. Logan: Utah State University Press, 2006. 38-56.
I applaud those who have called attention to this problem. Too many people eager to have machines score even large samples of writing are thinking of “writing” as if it’s just a shell that can be judged regardless of what fills it. That’s a serious mistake. It’s also one of the major perils for anyone who teaches and cares about good writing: computers do best when 1) the domain of the prompt is highly constrained; 2) the texts are very short; and 3) meaning is mostly ignored in favor of looking for simplistic linguistic structures. Is this what we really want to do with writing in education?
I assume we trust researchers when it comes to other domains of knowledge like climate change. In this case, we need to trust scholars of language, writing, and linguistics and not just take the word of computer programmers—or, better still, let’s put them into a more productive dialogue with each other.
Chris M. Anson
University Distinguished Professor
North Carolina State University
Laura Gibbs says
Thanks for your great comments, Chris – the examples are very well chosen. Being oblivious to such howlers is robograding’s dirty little secret, even though it is already being used extensively to score writing (as in the CLA; they have some research papers and such online here: http://www.collegiatelearningassessment.org/research). As long as the grading is only being reported as a statistical aggregate for a large cohort, the random outliers (which is what the howlers would be, such as taking a train to the Bahamas), it really doesn’t matter (as far as the aggregate goes) that there are such errors. But when you start talking about giving individual feedback to students about specific pieces of writing, then it DOES matter. The accuracy of the scoring overall in the aggregate is one thing; providing accurate feedback to individual students about individual pieces of writing is something entirely different. I am surprised at the claims that people seem to be making about the ability of the software to provide useful feedback to individual students. So, yes, I can see the advantages for large-scale scoring, and it is also possible to find statistical comparability of scores (i.e. numerical scores) assigned by machines to the scores assigned by humans… but actual feedback to individual students? I just don’t see how we could ever be confident in the quality of that individual feedback for every single student, including those who think they are going to take a train to the Bahamas.
Steve Fox, Director of Writing, Indiana University Purdue University Indianapolis says
Elijah’s post is interesting and thoughtful, and I’m sure that if teaching experts in various fields work with people like Elijah, they may develop some useful tools for learning. Meanwhile, however, policy makers and education “reformers” (beware such people), impatient with the complexity of human behavior, will continue down the current road that exalts standardized curriculum, testing, and assessment above professional learning communities, small schools, teacher-student and student-student interaction, and creative problem solving in idiosyncratic situations. Look at the Common Core State Standards. They aren’t terrible as standards go. When thoughtful, smart teachers work together to develop curriculum based on these standards, they do some good work–good, local work. But PARCC is developing standardized tests that will be used nationally to evaluate millions of students,their teachers, and their schools. And Pearson, at least, one of the vendors to have won a contract from PARCC, plans to use automated essay scoring, rushing full speed ahead where more thoughtful people like Elijah (and Chris Anson and Ed White and Les Perelman of MIT) fear to tread. Instead of raising the standards of the teaching profession and providing teachers (and students) with supportive environments within which to work and grow, we will judge millions of students by a few hours of standardized performance on standardized tests, and then be extension judge their teachers as effective or not. Who will want to teach in this environment? And as Debbie points out, who will want to learn in such ways? When I think of the good teachers I had through the years, and the good teachers I work with now, I would rather put money into supporting them and bringing more such people into the profession than into making large corporations like Pearson and ETS richer and richer, meanwhile demoralizing teachers. Read the lengthy contracts and reports on the PARCC website and tell me it doesn’t make you want to turn away weeping for what we are letting policymakers do. PARCC and Arne Duncan and Bill Gates and many state governors and legislators don’t want to hear the complex arguments we see on this site. They do want silver bullets, and they are paying good money to buy such bullets.
Michael Feldstein says
Chris, I think a lot of the problems that you point out arise from the problematic application of these tools to “grade” or “score” a paper (which also potentially has more far-reaching consequences for our entire system of education, as Steve points out). That’s not a problem of technology. It’s a problem of the application of technology. I find a grammar checker useful as a writer even though I don’t believe it is an appropriate technology to grade my grammar competence. Nor would I expect a grammar checker to provide deep guidance on my sentence structures. It’s just a tool to help me be more aware of certain aspects of my writing, where that awareness helps me to become a better writer. As Laura and I discussed earlier in this thread, even a tool like a grammar checker needs to be positioned carefully and designed with particular pedagogical intent if it is going to be useful to students who aren’t good at knowing when to ignore its advice. But I believe that is a problem, once again, of how the tool is designed and positioned rather than an inherent disqualifier of the technology because it doesn’t give perfect feedback. We don’t tell students to stop using the internet just because they might find wrong information on it. We teach them how to use it appropriately and apply a healthy level of skepticism.
Laura, I guess I still don’t fully get your objections. Or else I don’t agree with them; I can’t tell which yet. To the degree that it’s about skepticism that the technology can provide accurate enough feedback to be more helpful than harmful to students, I get that much. But that’s an empirical question. But you seem at times to suggest that any significant level of error in feedback from the software is intolerable. I just don’t believe that to be the case. Students get imperfect feedback and imperfect information all the time. What is the outer bound for you? Does the software have to provide feedback that is as accurate as what students would get from peer review? More accurate? As accurate as one would get from the teacher?
Debbie Morrison says
Steve references in his comment what I see as the primary barrier to this grading software tool as providing any benefit to students (I stress a ‘tool’ here, as this is what policy makers and administrators seem to overlook — it should be viewed as a tool to support learning, not a solution).
Kids in US public school are tested, and tested more. And its not for their direct benefit. The majority of kids despise it – they don’t see the value or purpose. Now we introduce another type of assessment, machine grading. Who is the primary beneficiary of machine grading? The teacher by not having to grade the student essays, or the student?
Machine grading, if used as a grading function to determine a student’s progress is similar to standardized testing. It removes the emotional and personal connection that our kids need to learn.
Laura Gibbs says
Michael, a machine cannot EVER give the same kind of feedback for writing as a teacher. So, it’s not about accurate or inaccurate in a way that can be compared one against the other. You can compare the SCORING of humans and machines (because scoring is a totally reductive type of analysis), but the feedback humans and computers give for a piece of writing cannot be compared – they are proverbial apples and oranges. Or apples and zebas.
Teachers make errors in feedback all the time, from haste, sloppiness, ignorance, inattentiveness, mean-spiritedness, and so on – that’s where the errors come from. But teachers DO COMPREHEND what they are reading. Computers, on the other hand, do not – so the source of errors in computer feedback is completely different from the source of errors in human feedback. Feedback from humans, flawed as it may be, is based on a level of comprehension that a machine can never, ever achieve, which was the point of Elijah’s original post.
Machines, as Chris pointed out, simply have no comprehension of what they are reading. Instead, they are able to provide a statistical description, a superhumanly accurate statistical description, of the data they are given. I suspect that the challenge of turning that statistical description (average word length, average sentence length, vocabulary targets) into something that constitutes useful feedback for a student is going to be impossible.
Of course, it’s all speculation at this point – no feedback of any kind has been offered up as a sample for our inspection here. We do know, however, that the kind of feedback that Elijah envisions is quite different than the right/wrong feedback of a grammar checker or spelling checker which is based on a fixed set of algorithms that disregard actual content (i.e. the writer’s attempt to make meaning), as opposed to a machine learning system which does learn to respond to content. Elijah’s system aspires to provide content-related feedback, but since the machine does not really understand the content (not the way humans do), I am not confident that any useful feedback for individual students will result – and the inaccuracies will be of a wildly bizarre nature, utterly different from anything a human being might provide. If you’ve played around with Google Translate or a Turing Test Machine, you know what I mean – it is bizarre as in alien-from-another-planet bizarre.
Elijah Mayfield says
Chris, I agree with most of what you’re saying, and the foundations of your claims are correct. I have minor quibbles with your examples – I think surface features are better at distinguishing between those cases than you give them credit for – but your general thrust is correct. I’m puzzled, though, at a jump that you’re making. First, you claim that computers cannot understand meaning, and I agree with that completely – your MIT friends are correct. You then say that computers cannot do anything meaningful with meaning. I’m mostly going to agree with that statement, too. However, you’re assuming that in order for computers to have a meaningful impact, they have to have the deep semantic understanding of humans.
That’s where I’d disagree with you. I think there are many potential ways in which features which are correlative but not necessarily causal can still lead to an automated interpretation of a text which, if not “insightful” in the same way as a human, is at least specific in detailed in precisely the reasons why a system makes a judgment that it does. To claim that there’s no way to get from “detailed rundown of precisely the features of your writing that would be judged a certain way by your teacher” to useful feedback to a student seems like an extraordinarily pessimistic view of technology. Is the field there yet? No, I don’t think so. Is the field aiming in that direction? For most of the companies in this industry, not really; I think most of the people here are right in saying that there’s a push towards standardization, rather than teaching, which gets the lion’s share of the attention.
That doesn’t mean, however, that the technology can’t be used in a productive way within the classroom. I think it’s useful for teachers, for students individually, and though I didn’t emphasize it in the initial post, there’s also potential for impact within collaborative learning, as I briefly alluded to in a response to Debbie. These things aren’t set in stone but they’re far from the hopeless case that you, Laura, and others make them out to be.
Elijah Mayfield says
Steve, I think your points of fact are correct, but you’re taking a pessimistic, defeatist tone to it that disappoints me as a researcher in this technology. While edX and other organizations are overly broad in their claims, it’s a far less damaging attitude than Les Perelman’s reactionary, dismissive, and harmful tone on the matter of automated scoring and assessment. This technology does have potential, and yes, that potential is most easily realized through standardized testing. It also has formative potential, as Mark Shermis and others in the field would happily agree – but that’s rightly a tougher nut to crack, because getting input to students right is a hard thing to do, something that most teachers struggle to do, and that schools or districts as a whole *really* struggle to do.
The incentives are there for summative assessment and group-level statistics. Organizations with massive staffs and costs, like Pearson or ETS, have to focus on those things by nature in order for this endeavor to be sustaining. A company like LightSIDE Labs, on the other hand, is tiny, and our costs are concordantly tiny. If we can get the opportunity to work with a handful of ambitious and thoughtful partner organizations earlier – and that’s exactly what we’re doing, and are happy to do with other universities or companies that approach us – then we can explore a space of uses for this technology that doesn’t have the same massive cost reduction potential, but still saves enough money and time that it’s worth it to work with us and pay us, while actually making a difference in quality of teaching.
(an aside: that’s the closest I’ve come yet to a direct pitch for my company – consider it an infrequent sojourn into the realm of something resembling “marketing.” My purpose on this blog is to inform on general issues in the field, rather than sell the work I’m doing in particular, but sometimes the two overlap)
Elijah Mayfield says
As a last aside, I don’t want my previous comment to sound like people at those larger organizations aren’t looking into this. They are, and there are very smart people there who are thinking deeply about these issues. However, it’s fair to claim, I think, that those products are harder to “get right” and will take awhile to come to market, especially from larger and slower-moving companies. Automated grading in a SBAC/PARCC fashion is somewhere that we can see immediate benefit from this technology, so of course it’s that lowest-hanging fruit where it’s first going to make inroads.
Laura Gibbs says
Elijah, I am trying to keep an open mind here – but your vacuous dismissal of Perelman is not helpful, and if you are going to consider people with valid reservations about this approach as just “reactionaries,” the dialog is not going to move forward as it needs to. Sooner or later, the proponents of machine-learning driven assessment and feedback systems do have to deal with the well-documented concerns that many educators, Perelman among them, have about this entire project; for people who have not been following that larger discussion, there is a good list of research findings here at the humanreaders.org site:
http://humanreaders.org/petition/research_findings.htm
These concerns cannot simply be dismissed as reactionary. These researchers raise valid and extremely important concerns about educational quality which you also say is your motivating concern. If you want to refute their claims, you are going to have to do so with substance; an ad hominem dismissal like the one you make here is not helpful.
Chris M. Anson says
Thanks, Elijah. Let me start by saying that this blog sets a standard for reasoned, interesting, diplomatic response; I wish interactions about important issues–especially politically-oriented ones, could be so civil. Hats off to the folks who have been commenting here, and, as Ed White said, for the tone you’ve established.
On your response–I’d rambled on in my last post quite a bit and actually eliminated some further remarks about technology and also about the process of inferencing, which is where computers really suffer. So let me add back in the part about my own position. I’m no Luddite; in fact, I love technology. I love what it enables. I love how it assists in so many areas of intellect and knowledge sharing. I love its capacity for democratizing access to information, and I love how it encourages writing (the current generation of learners writes more than any generation in history because of it). I’m almost hard-wired to it. I teach in a cutting-edge Ph.D. program in Communication, Rhetoric, and Digital Media, which is *all* about applying technology to teaching and learning. I’m actually strongly in support of using computers to analyze text–it’s part of my field (writing studies) and has yielded wonderful insights into style; differences in written communication based on demographic factors such as gender; the use of lexis; correlations of various language patterns; and so on. I’ve used it for that in an NSF grant, teaming up with SAS, which has its international headquarters here, to analyze 27,000 samples of student writing using their sophisticated software, which could make millions of passes through the data in a matter of seconds. I also love how technology can support and strengthen teaching (I’ve been doing studies on the use of voice-accompanied screen-capture technology in teacher response to student writing). So I’m in complete agreement that we need to keep pushing ahead at the interface of language and computers.
You’re pointing to a distinction between summative uses of technology (to score, say, test essays) and formative uses (to help people learn about their writing and potentially to write more effectively). For all the reasons many have pointed out, especially Steve Fox, I worry a lot about the use of machine scoring for large-scale testing, partly because of its effects on teachers (who try teaching “to” the reductive, limited kinds of writing that such testing favors–an issue as much about the nature of the tests as who or what scores them). I agree with you, however, that on some limited basis, computers might help formatively, even if they can’t provide much or any insight about the meaning students are trying to convey. But the implications also concern me. Imagine schoolteachers, persuaded by powerful for-profit purveyors of programs, having students use computer feedback instead of giving it to them authentically or building in opportunities for peer response: real, meaningful interactions that help the reader/responder learn as much as the writer. Since the computers can’t work at the level of meaning, young kids will develop a sense of themselves as writers–and of writing as a communicative medium–as something that can be fixed from mostly surface-level feedback, and without interaction with the computer “reader.” In universities, we struggle to help students see the complex relationship between ideas and their expression. Sure, expression matters: sentence balance and rhythm, correctness, effective use of words, paragraph length, and so on. But it’s utterly intertwined–imbricated–with meaning. You can’t really pull them apart. I’ve tried several programs designed to offer formative feedback, even in pretty restricted domains such as summarizing a reading, and I’ve been really disappointed because 1) they obviously don’t have a clue about what I’m really trying to say, and 2) they often get even the surface stuff wrong. (Simple example: identification of passive constructions. In a scientific report of medical research, no one writes “the research assistant spun down the sample in a centrifuge” because who did it is beside the point. But some programs, even Word, see the passive as something to be corrected.)
I’m persuadable. I’d love to have access to some good formative computer feedback programs, but Pearson and others don’t allow that any more (I experimented a lot when they did). Why? Because they know that once we start using them, we quickly discover my two disappointments above. Why should school districts and universities spend scarce resources on programs that pale in comparison to what teachers can do? Isn’t it better to spend our money helping those teachers do their jobs effectively and with appropriate support? Let’s save the technology for its most productive uses: research that doesn’t negatively affect the lives of teachers and students.
I have no doubt that, over (much) time, we’ll program computers with astonishingly robust funds of linguistic and world knowledge so that they might be able to interpret meaning the way Hal did in the film 2001. And that generation (because it won’t happen in my lifetime) will face the challenge of what role teachers will play, if any, in students’ learning.
Joshua Probert says
I’d add that machine learning algorithms can give you a score of it’s ability to sort a paper. If you have a bunch of typically “A” and typically “C” papers but also one which doesn’t easily fit into either category (e.g. the “creative” paper in Myth #4) it’s reasonably likely it could get flagged for manual grading.
Phil Hill says
I hope this doesn’t pull us off track, but I’ve noticed that the primary discussion alludes to human feedback with descriptions coming from Laura, Debbie, Jonathan, Chris and Steve from the faculty side. But what about the case of courses where faculty provide almost no formative feedback, or when that feedback is so untimely as to lose its value?
My daughter (8th grade, but advanced) wrote and turned in an essay 2 months ago where she put a lot of work into it. I even got some feedback from Laura and Meg Tufano on her paper via G+. Well, she got a 90 (which she only found out about by logging into the online grading), and she still has not received a marked-up paper or any form of feedback beyond a summative grade on the draft. 2 months! She has more feedback from Laura and Meg than her own teacher.
What is the role of automated essay grading / feedback in this case? Could it be used to provide *some* feedback, separate from a grade, in a timely manner? Is it possible that such tools, properly applied, might even help the teacher’s time management so that she could provide her feedback that includes understanding of the meaning?
This has been a fascinating discussion (one of the most informative I’ve read in the past year), but I sense that we’re missing a key issue.
Laura Gibbs says
Phil, she must be so frustrated, argh! I worry that rather than grappling with real problems of curriculum and teaching, people might want to have machines solve the problem. Not only do I think that is quixotic, as I’ve explained above, but it also distracts us from the really serious problems going on re: curriculum and teaching. Yet this dilemma is a great opportunity: your daughter wrote what she did for a human audience of course – would she be satisfied (or somewhat satisfied anyway) with a computer-generated response of some kind…? Ask her! I would be so curious to know.
You guys could take a look at the little video there at the LightSIDE website to get a very rough sense of what you get as a response with their system: in the example they provide, there are three colored bars that are illuminated as holistic feedback, and then you can see your own text highlighted with the “good” parts in one color and the “not good” parts in another color (you’ve then supposed to maximize your good tendencies and minimize your bad tendencies), but there’s no sense of a dialogue, no sense of someone interested in what you are actually saying. So, the feedback is not really even exactly what we would call feedback in the usual sense. But I would be really curious what your daughter would think about this… and yes, I understood her frustration very well. In many college courses you get no substantive feedback on your writing at all – you turn in a final paper at or near the end of the semester, you get a grade on it, maybe a few comments. It’s a hugely common problem – but it’s a problem we have brought on ourselves, and I don’t think the machines can save us from what we’ve done (or failed to do). College students deserve better and I think your daughter does too! I would be really curious to know what she would think of computer feedback as an alternative.
David says
As someone who teaches philosophy, I see this technology as being highly promising. I’m also relieved to see the emphasis on using this automatic grading tool as a learning supplement rather than a grade giver. We are still in early days and future editions will surely improve. (For example, the people who are working on context-aware grammar checkers can eventually merge the feedback of their tools with the output from the Bayesian program described here.) But already, I would be prepared to use the present version in my courses. It wouldn’t do any grading – that’s my job – but it would be a filter on student submissions. I would make it that no student is able to submit to me a paper that got less than a B from the software. If they can’t do that on the first try, they had better keep revising and resubmitting, maybe even get off their asses and run to a tutor before the paper’s due date. If they can’t, then yes, they will learn a valuable lesson from their F. (I can see making exceptions for papers of non-native English speakers.) Too many of the most frustrating papers to grade are the ones where the students simply did not give a fuck about what they wrote, but some of it is right, and you have to make difficult judgment calls about how much they understood based on their careless writing. The software would surely recognize that carelessness and act as my bouncer. This benefits both me and the student. Also, I like the idea that the student would have two audiences they need to satisfy: Me, who is very sensitive to the rightness of the details, and the machine, which picks up more holistic and formal features.
Phil Hill says
Laura, unfortunately it’s spring break and I just dropped off my daughter for a week-long trip to Mexico (building houses, not upside-down margaritas) – so I can’t ask her for a while. But you ask a good question (would she like the type of feedback that is possible today).
What I’ve seen is that she was chomping at the bit to revise and finish the assignment, but she’s losing interest and might ‘check out’ of the assignment or course soon. I sense that she would appreciate *any* feedback and the opportunity to make the paper better, but of course that is speculation.
By the way, I am not suggesting that technology is the solution to curriculum / teaching problems (I really liked Debbie’s description of tool vs. solution). The technology might not be ready yet – and if misapplied could do more harm than good – but I do see potential when put in the right hands.
Laura Gibbs says
Phil, I hope your daughter has a great trip! And I can definitely understand her frustration – one of the things I think is most important in assigning written work to students is to make it clear just why it is being written and what kind of feedback can be expected. In my classes, students do some writing which is a kind of “learn by writing” type of writing (writing responses to the weekly readings), and they get peer feedback from one another about that but not from me – while each week there is also a piece of formal writing that they turn in (part of a semester-long-project) which I read and comment on in detail, in addition to peer feedback. I keep what I call the “the stack” visible as a Google Document so that people can see I got their assignment, and I announce daily in the announcements (http://ouclassannouncements.blogspot.com/) where I am in the stack so that they know how soon they can expect feedback. It should never be more than 4 days maximum; usually it is quicker – and it’s those who turn in their assignments earlier who get feedback sooner.
My strong feeling is that it’s better to do less writing but with more feedback and revision, as opposed to doing extensive writing without feedback and without revision (the typical college writing experience). My students turn in assignments each week that are 700-1200 words, and they revise each assignment at least twice, and more often as needed. They get sentence by sentence commentary from me; for a fresh piece of writing that takes 15-30 minutes; for a revision, that takes 5-15 minutes. I spend appx. 30 hours per week working on my students’ writing… but then, that’s my job: I am a full-time online instructor, and I teach writing intensive classes. I love my job and I happen to think there should be more such jobs at the university level – it’s a great way to improve student writing and to get the students excited about writing. Having sustained, supportive, timely feedback from a skilled and experienced instructor is successful for students at all levels (and I see a huge range of writing skills among my students, from students in need of very remedial help to students who are ready to begin their careers as professional writers). As I am just an adjunct, it is not expensive; the university is, in fact, making quite a tidy profit off of each of my students (instructional costs per student per course: $300 … students pay $1000 per course).
I bring all this up because when it comes to the college writing experience, such as that described by David above, I think there are some excellent solutions, proven solutions – but I do not think a computerized gatekeeper is really going to improve the writing experience for David’s students or David’s experience in reading those papers. If students don’t give a flip about what they are writing, they are indeed likely to turn in poor quality writing or, even worse, plagiarize. A computerized assessment is not going to help them become more engaged with what they are writing, and I doubt it would alter their writing in anything but the most superficial way. What is required, in my opinion, is to completely redesign the writing assignments and the writing process for the class. In my first year of teaching college writing, I experienced total frustration when I relied on the typical sort of writing assignment – I was bored by what the students wrote and they were certainly far more bored than I was. Boredom is one of the biggest problems in college writing – student boredom, and teacher boredom as well.
The user of computers, and asking students to write for a computer as their audience, seems to me likely to increase that boredom rather than to alleviate it. For these machine learning scenarios to work, the writing prompts must be profoundly constrained in an attempt to make the students write as much like one another as possible. My goal is actually just the opposite – I try very hard to prompt students to write unprecedented work that surprises and delights them as much as it surprises and delights me. Computers cannot experience this surprise and delight; quite the contrary: such surprises can cause serious problems for their algorithms.
David says
Laura, I think that terrible papers come from two sorts of careless students: the sort that realize they’re handing in terrible work, and the ones that don’t. From my experience, that split is about 25/75. The more numerous group are some strange creatures, and I’ve invested some thought into figuring out what makes them tick. First, they’re not very invested in their educational experience. No surprise there. But more interestingly, they seem to have a fundamental skepticism about the reliability of grading in the humanities. Some honestly think that if their paper is long enough and they did the reading, they are entitled to a C, no matter what they actually said. I think that honestly, they would trust a computer grader more than they trust me, because computer results would be seen by them as “objective” – whereas my “F” is interpreted as me simply not liking what they had to say.
They are also the kind of people who never practiced going through revisions, or ever polished a paper. They didn’t have a Laura Gibbs as a writing instructor. This is understandable, since at this point, that is the most labor-intensive thing you can do in education. I’m certainly not in a position to workshop writing with my students; I have a life to live. My job is to teach philosophy and assess how good they’ve become at it, through the essays they hand in. Of course I would also like to make their writing a focus of attention, to go through drafts and offer feedback, etc. But I only have time to do this with students who choose to do it with me, and they are not the ones who are failing my classes.
I’m pretty sure that a poor paper which is blocked by a gatekeeper from being submitted would get revised. Very few students are so jaded as to not try again (and such students do themselves no favors though staying in college). Sure, they would be thinking about how to satisfy the computer, but at least they would be thinking – and revising. Even a moderately accurate gatekeeper would do a lot of good. But you’re wrong that the good and average students would write in order to please the computer, because it’s still me who does the grading, without any input from the computer. I would just be grading fewer first drafts, and everyone would win, for the trivial cost of running an algorithm.
And since you brought up plagiarism: For this, we have a very easy algorithmic solution, and I guess I was assuming that a submission for digital grading would also run an analysis against some sort of plagiarism database. Let’s face it: we humans have already been bested by computers in the area of plagiarism detection. It’s a new world!
If I have a worry about these software gatekeepers, it’s that they are created by profit-seeking companies. Services like turnitin are far too closed and extortionistic in my opinion. That aspect offends my academic sensibilities. A consortium of universities should create an open source alternative, where the act of donating data is the only membership cost. I think the same of grading algos. They too should be open source, and universities should be paying people to refine them directly, instead of paying for a subscription. But then again, I think the same thing about textbooks, and we are depressingly far from this goal.
Laura Gibbs says
David, the tone of your second comment here is certainly very different from the first; I misread your complaint the first time because you sounded just kind of fed up with the whole business of student writing. The idea that students are very suspicious of grading in the humanities is quite true (that’s why I only give feedback on my students’ papers and they revise and revise; I don’t grade the iterations). I suspect, though, that since the students are starting from a position of fundamental mistrust, just as you say, they will be even more mistrustful of a computer that cannot explain a bad grade in terms they will understand if they feel that they have been graded unfairly. And if they find out the bad grade was assigned for statistical reasons, that will be even more frustrating since it is something they can never perceive for themselves; they can never hold their paper and the 100 pre-marked papers in their minds in order to grasp the situation as the computer sees it. Computer-based grading is objective, yes, but it is not the objectivity of marking multiple choice or short answers on a quiz which you are based on “the right answer” v. “the wrong answer.” Instead, this objectivity is the objectivity of a complex statistical analysis that even the teacher probably would not understand or be able to explain in detail, much less the student. If students already feel alienated from school and/or confused about why they are doing poorly (and I agree that there are students who fall into both categories), it seems to me that a computer is only going to increase that sense of alienation and confusion.
But we shall see. Exactly because there are so many people who are assigning writing without time for a real writing process in their classes (and Common Core will make this a problem of epidemic proportions in K-12), we are going to see teachers who are legitimately without the time to do what they are being asked to do. My solution is to change what they are asked to do – and as I said, I’m a proponent of doing LESS writing, but doing that writing BETTER (more readers, more revision, more variety, etc.). In any case, computers are not the only solution to curriculum problems in our schools … although they may be the only way to manage the MOOC mania, I will grant that.
Just practically speaking, if you really are serious about wanting to use this kind of software, it means you would either have to use the same writing prompts as other philosophy faculty and share out the duty of “teaching the machine” among you, or you would have to take on the burden of teaching the machine yourself. Given how large your classes are and how limited your writing prompts are already, that may or may not be feasible. The examples provided by LightSIDE are for writing that is a paragraph long in response to an extremely narrow prompt. I have no idea if they are prepared to tackle a typical multipage paper in which there is anything at all open-ended about the prompt, any hint of something creative or innovative that a student might be allowed to do in their writing that would not already be fully anticipated by the pre-marked papers used to teach the machine.
The question of open source algorithms speaks exactly to the problem of explaining grades to the students. I fear that this is not going to happen in an open source way at all – which means you will be trying to defend marking done by a computer whose algorithms are in fact unknown to you. The TurnItIn algorithms are pretty easy to guess, so the fact that we don’t know exactly what those algorithms are does not hamper us (although I don’t use TurnItIn myself – and that’s a separate discussion). In this case, the algorithms are not going to be anything obvious at all, so, yes, I imagine we are just going to have to “trust the computer” and tell the students to do the same.
Since I don’t want to come across as a reactionary defender of the status quo, there are lots of things I would like to see done differently in the teaching of writing – but going in a very different direction than computerized assessment. If people want to read a completely different kind of piece about writing innovation, I recommend this recent article by Mike Rose in Inside Higher Ed:
http://www.insidehighered.com/advice/2013/04/05/making-our-writing-matter-essay
Aaron Nielsen says
I think that myth number five is the real problem. We know that people tend to believe what’s “mathematical” more than other things. Daniel Kahneman wrote in “Thinking, Fast and Slow” about his own experiences grading essays by hand and finding that the first essay resulted in a strong correlation with subsequent essays written by the same student on an exam unless he forced himself to read them out of order and hide the first scores. I fear that instructors will be biased based on the computer’s score, and they will not be cognizant of those biases regardless of the computer’s actual performance.
We really need to consider how these kinds of biases affect people’s interpretations. The current crop of earlier literacy tests (e.g. DIBELS) suffer from a lot of these issues. The interpretation of results is quite statistical which is something that is quite difficult for a person to comprehend without a lot of hard thinking and in my experiences with my child’s teachers the teachers have not been properly trained to understand this.
Norma Ming says
Laura, you make an interesting point about balancing “extensive writing without feedback and without revision” against “less writing but with more feedback and revision.” I think this raises a key issue about matching the assignment (and its method of assessment) to the learning goal. When is it better for students to do deep writing and revision, of the sort which is best handled by ongoing interaction with expert human writing instructors? When is it better for students to write short products that could potentially be assessed by automated systems? When is it better for students to answer well-designed multiple-choice items that can be machine-scored? I believe we need to examine this from the students’ perspective (how they benefit) as well as the teachers’ and designers’ perspectives (how much effort it takes to create these assignments and to assess the student products). All three may have value, but we need to determine how to allocate time and resources across them.
As Phil points out, if the teacher is not able to give prompt and helpful feedback on a particular assignment, automated assessment could help (in addition to providing PD to help human feedback become more efficient and possibly realigning the assignments). Even the most dedicated and efficient teacher is limited in the amount of high-quality feedback s/he can provide. When we reach those limits, we could assign less writing, hire more graders, increase peer feedback, and/or develop better technologies for providing some automated feedback. Given the limits on the first three, I’m interested in exploring how the last can help.
To me, the crux of the matter is this: When is it better for a student not to do any writing at all, than to write something which will receive only machine feedback? I’m not talking about an entire course or even an entire unit, but simply a very narrow learning opportunity. Obviously it depends on the quality of the machine feedback. There also comes a point when the machine feedback is so poor that it would be better not to receive that feedback at all.
I’ve written more about these ideas at http://wp.me/p3kowi-13 (also linked in the pingback above).
Laura Gibbs says
I found your blog, Norma, and I look forward to learning more; thank you! I agree that the question “When is it better for students to write short products that could potentially be assessed by automated systems?” is crucial – if a writing assignment is really just a content-mastery test in disguise, a short-answer or glorified fill-in-the-blank masquerading as a writing assignment, it will be easier for all concerned to turn it into a test instead, either multiple choice or some other kind of machine-scored format (i.e. very short form writing, 20 words or so), but not something where we have any pretense that the rhetorical value of the writing matters, the actual communication of a writer to a reader.
The sample provided by LightSIDE definitely falls into the category of glorified short-answer where the prompt is extremely narrowly defined and students are expected to simply list a series of traits, limited in nature, which arise from a comparison of one map with another, and the machine is apparently rating the presence (or absence) of those comparisons in the resulting paragraph. The feedback students get will not improve their writing per se, and how could it, since the quality of the writing cannot be directly measured by the computer.
The LightSIDE system envisions peers feedback as well – and that could work, but not just as something optional or extra; it is essential. Human writing requires human readers if the quality of the writing (not just the presence or absence of expected information) is to be evaluated. There are certainly many ways as well in which machines can assist in peer feedback and also in self-assessment (self-assessment and revision both being essential parts of any writing process) – for example, a computer can facilitate audio recording, where students would record themselves reading their own writing and then listen to their own writing aloud in order to experience their writing in a new mode that will reveal strengths and weaknesses aurally. That’s the kind of thing I would like to see – computers facilitating new forms of human engagement with writing – rather than this quixotic quest in which we pretend that computer evaluation of writing per se is comparable to human evaluation.