Here’s a nifty video summary of a doctoral dissertation by Derek Muller that a client pointed out to me:
The basic gist is that students have pre-conceived notions that are wrong, and it is very hard to dislodge those mistaken notions. If you show them a video with an accurate explanation, the students will say that the video was clear and helpful, but they will misremember it as confirming their (mistaken) preconceived notions. In short, they won’t learn. In contrast, if you show them a video that starts by directly stating and then refuting their misconception, they like the video less and say it is confusing, but they actually learn more. This is a really important pedagogical point to know whether you are giving traditional in-class lectures, writing curricular materials, or creating one of those oh-so-modern video lectures that all the cool kids are into these days.
It’s also a good example of the kind of insight that big data is completely blind to. And it gives us good reason to be skeptical that taking large lecture courses online, turning them into REALLY large lecture courses (with nice videos), and expecting that new and more effective pedagogies will rise out of the data because, you know, science or something, is more of a hope (or a fantasy) than a plan to improve education.
Let’s say you have one of those ultra-hip MOOC platforms with a bazillion courses running on it and a hadoop thingamabob back end that’s tied to a flux capacitor, an oscillating overthruster, and a machine that goes “ping!” You’ve got all the big data toys. And let’s say that, among the many thousands of lecture videos being used on your platform, a bunch of them are designed the way Muller’s work suggests is best practice. Some of these were done this way consciously with awareness of the research. Some were done this way on purpose but based on intuitions by classroom teachers. They don’t have a name for what they’re doing, and they don’t really think about it as a general pedagogical strategy, but they have learned from experience that there are certain spots in their courses where they have to confront some misconceptions head-on. And then some of the videos may be in the Muller format completely accidentally. For example, maybe there’s a video of students working through a problem together. The first idea they come up with is the misconception, but they talk it through together and come up with the right answer in the end. This wasn’t planned, and the teacher who posts the video may not even be aware of why this sequence of events makes the event effective. Maybe she believes in the value of watching students work through the problem together and posts lots of student conversations videos, some of which end up being in Muller’s format and some of which don’t. Let’s assume that many of these videos are effective at teaching the concepts they are trying to teach, and let’s also assume that they are effective for the reason that Muller hypothesizes.
The first question is whether our super-duper, trans-warp-capable, dilithium crystal-powered big data cluster would even identify these videos as noteworthy. The answer is maybe, but probably not reliably so. Muller set up a controlled experiment with one variable designed to test a well-formed hypothesis. He was measuring whether this style video was more effective than the alternative of a more traditional lecture delivery. In science, this is called a “control of variables strategy.” In product development, it’s called “A/B testing” or “split testing.”
Big data usually doesn’t work that way. Instead of creating a tightly controlled set of conditions, it usually looks at what’s available “in the wild” and relies on the massive numbers of examples it has plus the power of computers to do lots of comparisons really fast to come up with inferences. Let’s say, for example, that you’re a medical researcher trying to figure out the role of genetics in a particular type of cancer. There are many, many genes that could be involved, and it may be that a bunch of them are involved but interact in complex ways. And, of course, environmental factors such as diet or exposure to carcinogens, as well as a certain amount of chance, can all impact whether a particular individual gets cancer. The good news is that, while there are many variables, they are finite in number, mostly known and measurable, and mostly have a quantifiable and reasonably regular impact on the cancer outcome (if you understand all the interactions sufficiently well). If you have a large enough database of patients with enough genetic material and good details on the non-genetic factors that you think probably contribute to the likelihood that they will get cancer, then a big data approach will probably help. There are regular patterns in the data. The main challenge is sifting through the mountains of data to find the patterns that are already there. Big data is good for that kind of problem.
But education doesn’t work that way. The same video may impact different students very differently, due to variables that mostly aren’t in our computer systems. For one thing, classes can be taught differently in many, many different ways, some of which matter and some of which don’t. Again, if we were doing a split test in a MOOC context, we could control the variables what happens when you just change one video for a class that is otherwise the same for many students. That approach has significant research value, but it’s not big data magic. It’s educators who come up with hypotheses and test them using a large data set. Students are also very different, in important ways that often don’t show up the data that we have in our online systems. Silicon Valley is not going to make us magically smarter about teaching.
Now, big data enthusiasts might argue that I’m not thinking big enough in terms of the data set, and that could make a difference. Knewton, for example, claims that their system can track students across courses and semesters and test hypotheses about them over time. For example, suppose a student is struggling with word problems in a math class. It’s possible that the student is having difficulty translating English into math variables, or trouble identifying the important variables in the first place. Those are both math-related issues. But it’s also possible that the student just has poor English decoding skills in general. Knewton claims that their system can hold all of these hypotheses about the student and then test them (presumably using some sort of Baysian analysis) across all the courses. If there is evidence in the English class that the student is struggling with basic reading, then that hypothesis gets elevated. And maybe that student gets extra reading lessons slipped in between math lessons. It sounds really cool. I haven’t seen evidence that it actually works yet, and to the degree that it does, it raises other questions about whether you need all student educational interactions to be on the platform in order to get the value, who owns the data, and so on. Put this one in the “maybe someday” category for now.
But even granting that you can get sufficiently rich information about the students, there’s another hard problem. Let’s say that, thanks to the upgrade in your big data infinite improbability drive made possible by your new Spacely’s space sprocket, your system is able to flag at least a critical mass of videos taught in the Mueller method as having a bigger educational impact on the students the average educational video by some measure you have identified. Would the machine be able to infer that these videos belong in a common category in terms of the reason for their effectiveness? Would it be able to figure out what Muller did? There are lots of reasons why a video might be more effective than average. And many of those ways are internal to the narrative structure of the video. The machine only knows things like the format of the video, the length, what kind of class it’s in, who the creator is, when it was made, and so on. Other than the external characteristics of the video file, it mostly knows what we tell it about the contents. It has no way for it to inspect the video and deduce that a particular presentation strategy is being used. We are nowhere close to having a machine that is smart enough to do what Muller did and identify a pattern in the narrative of the speaker. Now, if an educational researcher were to read Muller’s research, tag a critical mass of the relevant videos in the system as being in this style, and ask the machine to find other videos that might be similar, it’s possible that big data could help. It might come back with something like, “Here are some videos that seem to have roughly the same kind and size of effect on test scores as the ones with the Muller tag.” Maybe. Even then, you’d have to have human researchers go through the videos the computer flagged—and there might be a lot of them—to see which ones really use the same strategy and which ones don’t. That would be better than nothing, but it’s far from magic.
By the way, the low-tech method commonly used now is even worse. Not only is it useless, it’s actually harmful. A/B tests are rarely done on curricular materials, but surveys and focus groups where students self-report the effectiveness of the materials are common, particularly among textbook publishers. And in that situation, the videos that the students report to be harder and more confusing would actually be the more effective ones. But, lacking any measure other than the survey of their real effect on learning, the publishers (or teachers) generally would toss out the more effective videos in favor of the less effective ones.
Whether we’re talking about machine learning or human learning about how to improve education, the real problem is that we don’t have a vocabulary to talk about these teaching strategies, so we can’t formulate, test, and independently verify our hypotheses. In the machine learning example, we could create an arbitrary “Muller” tag in the system, but we don’t have a common language among teachers where we say “Oh, yeah, he’s using the confront-the-misconceptions (CTM) lecture strategy for that one. I prefer doing a predict-observe-explain (POE) experiment to accomplish the same thing.” If we had a widely adopted language that describes the details of why instructors think a particular aspect of their lecture or their discussion prompt or their experiment assignment is effective at teaching, then big data could be helpful because we could tag all our videos with pedagogical descriptions. We could make our theories about teaching and learning visible to the system in a way that it would be more able to test. And, perhaps even more importantly, human researchers could be more effective at collaborating with each other on testing theories of teaching and learning. Right now, what we’re trying to do is a little like trying to conduct physics research before somebody has invented calculus. You can do some things around the edges, but you can’t describe the really important hypotheses about causes and effects in learning situations with any precision. And if you can’t describe them with precision, then you can’t test them, and you certainly can’t get a machine to understand them.
More on this in a future post.