Whatever else you think of the election, it has been the mother of all teachable moments for many of us. It has raised questions about what we thought we knew about our democracy, our neighbors, our media…and apparently learning analytics. The shock of the polls being “wrong” has raised a lot of questions about how much we can really trust data analytics. Audrey Watters has written the most fleshed out critique that I’ve seen so far. But Dave Cormier tweeted about it as well. And I have had several private conversations along these lines. They all raise the question of whether we put too much faith in numerical analysis in general and complex learning analytics in particular. That is an excellent question. But in doing so, some of these arguments position analytics in opposition to narratives. That part is not right. Analytics are narratives. They are stories that we tell, or that machines tell, in order to make meaning out of data points. The problem is that most of us aren’t especially literate in this kind of narrative and don’t know how to critique it well.
This is going to be a wide-ranging post that goes a little lit crit at times and dives into an eclectic collection of topics from election polling to the history of medicine. But even the most pragmatic, b-school-minded entrepreneur or VC may find some value here. Because what I’m ultimately talking about is a fundamental limiter on the future growth of the ed tech industry. The value of learning analytics, and therefore the market for them, will be limited by the data and statistical literacy of those who adopt it. The companies that are focused on developing fancier algorithms are solving the wrong problem—at least for now. These tools will have limited adoption until they are put into the hands of educators who understand their uses and limitations. And we have a long way to go in that department.
On Sensing and Sense-making
Why do we collect data? To extend our senses. In an online class, we can’t physically see when students enter or leave the classroom or whether they appear to be listening to the discussion. We gather data like logins and page views as either proxies for sensing abilities that we’ve lost or supplements giving us sensing abilities that we’ve never had. But the minute we start organizing the data for analysis, we have moved from sensing to sense-making. Which is more meaningful: A student’s frequency of login or recency of login? What do those metrics tell us? Your answers to these questions are hypotheses, which are stories about how the world works. What makes hypotheses special as stories is that there has to be a test that can prove whether they are false. If you can’t test a hypothesis, then it isn’t a hypothesis.
For example, maybe you think a student’s recency of login tells you something meaningful about whether they have mentally checked out of the course and therefore are more likely to fail. Think about that last sentence. It’s a story. I see some data points representing student behaviors and I construct a narrative about what those data points might mean. I then test my hypothesis by comparing recency of login to dropout rates. If there is no correlation, that suggests my hypothesis—my story about what certain student behaviors might mean—is wrong.
If there is a correlation, however, that doesn’t prove that my theory is right. There might be other reasons why I’m not seeing a student login. Maybe there’s something wrong with my data because the student has multiple logins. Maybe the student doesn’t need to log in to do the work. Maybe our login data is just statistical noise. Maybe there is another reason we are not seeing why the same students who are not logging in also happen to be failing at higher rates. Maybe the students who haven’t logged in recently are too poor to have reliable internet access and have other problems related to their poverty that are interfering with their ability to do work. We can make up lots of different and conflicting stories that fit the data. It’s all story-telling.
Notice that we’re not even getting to fancy algorithms yet. We have one data point—a login—tied to a date. That’s it. Recency of login. We then test our hypothesis by gathering a bunch of time-stamped logins for different students and test it against one other type of data point—whether they passed the class. We’re hanging an awful lot of meaning on relatively little information.
A similar sort of thing happens every time we label the X and Y axes on a graph. Minutes logged in per day. Correct assessment answers per learning objective. Number of social connections per student. Whenever we assign signficance to the intersection of two data dimensions, we are telling a story. One that may or may not be important or complete or true.
And just like other kinds of stories, sometimes there are resonances in our data stories that are pre-rational. Suppose I say to you that there is a 70% chance of rain tomorrow. What does that mean to you? What would you expect from the day? Would sunshine surprise you? If so, how much? Enough to say that the forecast was “wrong”?
Now let’s reverse it. Suppose I said that there is a 70% chance of sunshine tomorrow. If you’re like me, you have to think about that for a moment. Our anchor is “rain,” the adverse event that we have to prepare for. I don’t have an intuitive sense of what “a 70% chance of sunshine” means. I do have an intuitive sense of what “a 30% chance of rain” means. So already there is something a little screwy about the way context changes our sense of what a probability means.
Now suppose I told you that Hillary Clinton had a 70% chance of winning the election. This is, in fact, the probability that FiveThirtyEight.com assigned the night before the election. Was your level of surprise that Donald Trump won similar to your level of surprise on a day where the forecast is a 70% chance of rain but it turns out to be sunny? Mine wasn’t. There are lots of possible reasons for that. One might be that we’ve heard so much about the wizardry of poll aggregation and the mythos of Nate Silver that the narrative shifted our sense of confidence. But a 70% chance is a 70% chance, whether it’s of rain, sunshine, a field goal kick, or a Presidential election. Even on this basic level, our sense of the meaning of a probability is surprisingly unstable. Lacking reliable intuitions about numbers and correlations that are central aspects of data and analytics literacy, we tend to fall back on other signals of trustworthiness. Those signals may lead us astray.
The Nature of Algorithmic Story-Telling
The truth is that even the experts struggle with their intuitions once the data analysis gets even moderately complicated. Nate Silver got a lot of flak pre-election because other polling experts thought the probability he assigned to a Clinton victory was too low. FiveThirtyEight’s argument was that there could be systematic errors in state polling. For example, because Ohio and Pennsylvania have similar populations, an error in one poll might suggest an error is more likely in the other. This is exactly what happened:
In addition to a systematic national polling error, we also simulate potential errors across regional or demographic lines — for instance, Clinton might underperform in the Midwest in one simulation, or there might be a huge surge of support among white evangelicals for Trump in another simulation. These simulations test how robust a candidate’s lead is to various types of polling errors.
In fact, Clinton’s Electoral College leads weren’t very robust. And the single biggest reason was because of her relatively weak polling in the Midwest, especially as compared to President Obama four years ago. Because the outcomes in these Midwestern states were highly correlated, having problems in any one of them would mean that Clinton was probably having trouble in the others, as well.
The national polling was a pretty accurate predictor of the popular vote. But we don’t elect Presidents by popular vote. The state polling, which is more predictive of the electoral college count, was off. More importantly, it was systematically off.
This too is a story. It’s not the only story we can tell about polling and the election. We can talk about two historically unpopular candidates, the undercounted silent majority, the rise of white nationalism, or any of a number of other narratives. But notice that all of these stories are layered on top of “a 70% chance.” At best, they only indirectly get at how we arrived at “70% chance” to begin with. To understand how we ended up with that number, we’d have to ask different questions. Where did the data come from? How accurate is it? Does it measure the right things? Are the calculations we make on those data based on assumptions that are in any way shaky?
Most of us aren’t good at knowing how to evaluate if “a 70% chance” was even accurate. To know that for sure, you’d have to run the same event over and over again, like a coin flip. You can’t do that with elections. Or classes. Failing that, you’d have to have developed some pretty sophisticated intuitions about both probability in general and the specific ways in which the polling data were being gathered and processed. I, for one, do not have that level of sophistication.
But if you’re making important decisions based on algorithmic stories like “Hillary Clinton has a 70% chance of being elected President” or “this student has a 70% chance of passing the class,” then you need to have that level of literacy in order to avoid potentially nasty and consequential surprises. Those surprises, in turn, can engender a natural distrust of the information source that you feel misled you.
In the world of learning analytics, this means that complex products will not gain high levels of trust with educators or other stakeholders until those people have sufficient levels of literacy to evaluate the accuracy and value of the algorithmic story-telling that the products are providing.
Creating a Culture of Analytics Literacy
There is a precedent for this kind of professional transformation toward one of data literacy. Back in the 19th Century, medical science made some major advances. Germs were discovered and germ theory was developed as a new way of understanding diseases. Similar advances were made in parasitology. Technological advances like the X-ray machine and anesthesia helped this science along.
But by the early 20th Century, doctors were still not being trained as scientists. In 1910, the Carnegie Foundation published a report by Abraham Flexner which argued, among other things, that doctors needed to be trained on scientific approaches. The Flexner Report had wide-ranging consequences, not all of which were good, but one of them was to bring a huge amount of private foundation money to help make medical schools into schools that teach science. That had wide-ranging implications of its own.
Think about that fundamental fact of statistics: In order to know if your probability estimate was accurate, you need to run an experiment a whole bunch of times. In medicine, that means reporting outcomes. Did the patient live or die after the operation? If you get enough of these reports, you can begin to tell a story about how likely a patient with a particular condition is to live after a particular operation. But the idea of reporting that their patients died was anathema to physicians before they started to think of themselves as scientists. It seemed like advertising the most fundamental and painful failure possible. Before physicians, as professionals charged literally with the life and health of their patients, could trust their reputations, careers, and the lives of their patients to science- and data-driven medicine, they had to learn how it works.
Even today, I would argue that many physicians do not have have high levels of this sort of literacy. If you’ve ever been to a doctor that responds to every complaint with a battery of tests without explaining why, or been bumped along a chain of different specialists because each one ran some tests, sent you a bill, shrugged his or her shoulders, and sent you on, then you’ve probably seen some doctors who don’t have good instincts about their science. They aren’t good diagnosticians because they don’t know how to critically evaluate the stories that their machines and tests and algorithms tell them.
A personal story: I have struggled with back pain for years. I’ve been to many, many specialists. In some cases I’ve been to three or four of the same kind of of specialist. Everybody has their recipe for tests and treatments to try but nobody has a solution. I eventually came to the understanding that my problem has multiple interacting causes and no one practitioner or specialty has a view into all of those causes at once. But one experience on my journey was particularly revealing. About five years into my ordeal, my primary care physician discovered that I have what’s called an “episacral lipoma” or a “back mouse.” Without going into detail, it’s a benign fatty tumour that normally is harmless but can cause pain if it happens to be in the wrong place and sits on top of a nerve root. By the time my doctor found the lipoma, I had easily been to ten, maybe fifteen different specialists. He told me, “Now that I know these things exist, I’m seeing them everywhere. But they’re hard to diagnose because they don’t show up on any of the scans. In order to find them, you have to actually touch the patient’s back.”
I thought about it. None of the specialists I had seen had ever touched my back. They trusted their instruments—their analytics—to turn up any problems. When those failed, they had no intuitions to fall back on. For all their education and specialization, they lacked a certain literacy.
None of this is to say that MRIs and blood tests and the like are useless. They extend the doctor’s senses. But when the sense-making of the machines outstrips the professional’s ability to independently evaluate the stories that the algorithms are telling, there is bound to be trouble.
The Transformation of the Teaching Profession
Right now, the educational technology market is blithely barreling down the road of developing sexy, sophisticated algorithms. Setting aside the very serious and poorly attended question of student data privacy, there is an implicit assumption that if the algorithms become sophisticated enough, then the market will follow. But “sophisticated” also means “complex.” If we, as a culture, lack the basic literacy to have clear intuitions about what “a 70% chance” means, then how likely is it that we won’t have shocks that cause us to distrust our learning analytics because we didn’t understand their assumptions and limitations? These products face very serious (and completely justifiable) market risk as long as practitioners don’t understand them.
We need to transform our teaching culture into one of learning science and data science literacy. We need our educators to understand how algorithmic storytelling works and develop intuitions about data interpretation. There are two sides to this coin. On the one hand, this means developing new skills and embracing the sciences of teaching and learning. On the other hand, it means not fetishizing the instruments to the point where we no longer think to touch the patient’s back. Data should extend our senses, not be a substitute for them. Likewise, analytics should augment rather than replace our native sense-making capabilities.
If the history of medicine is any indication, this sort of transformation will take several decades, even with heavy investment. It also risks burying certain kinds of teaching intuitions even as we make other gains. Until we tackle this challenge, data literacy is going to be a fundamental limiter on the uptake of data-driven educational technology products.