EdSurge has a good piece up about the U.S. Public filing submitted by the Scholarly Publishing and Academic Resources Coalition (SPARC) with the U.S. Department of Justice opposing the merger between Cengage and McGraw-Hill. In addition to the expected fare about pricing and reduced competition, there is a surprisingly fulsome argument about the dangers of the merger creating an "enormous data empire."
Given that the topic at hand is an anti-trust challenge with the DoJ, I'm going to raise my conflict of interest statement from its normal place in a footnote to the main text: I do consulting work for McGraw-Hill Education and have consulting and sponsorship relationships with several other vendors in the curricular materials industry. For the same reason, I am recusing myself from providing an analysis of the merits of SPARC's brief.
Instead, I want to use the data section of their brief as a springboard for a larger conversation. We don't often get a document that enumerates such a broad list of potential concerns about student data use by educational vendors. SPARC has a specific legal burden that they're concerned with. I'll briefly explain it, but then I'm going to set it aside. Again, my goal is not to litigate the merits of the brief on its own terms but rather explore the issues it calls out without being limited by the antitrust arguments that SPARC needs to make in order to achieve their goals.
Let's break it down.
When is bigger worse?
While I'm sure that PIRG's concerns about the data are genuine, keep in mind that they have been fighting a long-running battle against textbook prices, and that the primary framing of their brief is about the future price of curricular materials. Their goal is to prevent the merger from going through because they believe it will be bad for future prices. Every other argument that they introduce to the brief, including the data arguments, they are introducing at least in part because they believe it will add to their overall case that the merger will cause, in legal parlance, "irreparable harm." As such, that has to be the standard for them. It's not whether we should be worried about misuse of data in general, but about whether this merger of the data pools of two companies makes the situation instantly worse in a way that can't be undone. That's pretty high bar. Each of their data arguments needs to be considered in light of that standard.
But if you're more concerned with the issues of collecting increasingly large pools of student data in general, and if you can consider solutions other than "stop the merger," then there is a more nuanced conversation to be had. I'm more interested in provoking that conversation.
What can be inferred from the data
One question that we're going to keep coming back to throughout the post is just how much can be gleaned from the data that the publishers have. This is a tough question to answer for a number of reasons. First, we don't know exactly everything that all the publishers are gathering today. SPARC's doesn't provide us with much help here; they don't appear to have any inside information, or even to have spent much time gathering publicly available information on this particular topic. I have a pretty good idea of what publishers are collecting in most of their products today, but I certainly don't have a comprehensive knowledge. And it's a moving target. New features are being added all the time. I can speak a lot more confidently about what is being gathered today than on what may be gathered a year from now. The further out in time you go, the less sure you can be. Finally, while publishers—like the rest of us—have thus far proven to be relatively bad at extrapolating useful holistic knowledge about students from the data that publishers tend to have, that may not always prove to be the case. So with those generalities in mind, let's look at SPARC's first claim:
Like most modern digital resources, digital courseware can collect vast amounts of data without students even knowing it: where they log in, how fast they read, what time they study, what questions they get right, what sections they highlight, or how attentive they are. This information could be used to infer more sensitive information, like who their study partners or friends are, what their favorite coffee shop is, what time of day they commute from home to school, or what their likely route is.
How much of that "more sensitive information" that SPARC claims can be inferred really logical to fear right now? Most of the scary stuff they speculate about here is location-related. Unless the application page specifically asks the student's permission to use geolocation and the student grants it—I'm sure you've had web pages ask your permission to know your location before—then the best it can do is know the student's IP address, which is a pretty crude location method. None of the place-based information is really accessible via any data that is collected through any courseware that I'm aware of today. The only exception I know of is attendance-taking software. How much of an additional privacy risk it is to know the attendance habits of students who are already known to have registered for a class in virtue of the fact that they are taking and using the curricular materials associated with the class is an open question.
The other risk SPARC references specifically is knowledge of social connections. There are products that do facilitate the finding of study partners. Actually, the LMS market, which is roughly as concentrated as the curricular materials market, may have much more exposure to this particular concern.
While I certainly wouldn't want these data to be leaked by the stewards of student learning information, I suspect there is much better quality data of this sort that is more easily obtainable from other sources. Even in the worst case, if they got misappropriated and merged with consumer data sets, the incremental value of this information relative to what someone with ill intent could learn from the average person's social media activity strikes me as pretty limited.
Of course, the information value is a separate question from the responsibility of care. Students are responsible for the information that they post on their social media accounts. Educators and educational institutions have a responsibility of care for data in products that they require students to use. That said, we should think about both the responsibility of care and the sensitivity of particular data. Generally speaking, I don't see the kind of location and and personal association data that publisher applications are likely to have as particularly sensitive.
Anyway, continuing with SPARC's brief:
“We now have real time data, about the content, usage, assessment data, and how different people understand different concepts,” said Cengage CEO Michael E. Hansen in an interview with Publishers Weekly.135 McGraw-Hill claims that its SmartBook program collects 12 billion data points on students. Pearson now allows students to access its Revel digital learning environment through Amazon's Alexa devices—which have been criticized for gathering data by “listening in” on consumers.
Once gathered, these millions of data points can be fed into proprietary algorithms that can classify a student’s learning style, assess whether they grasp core concepts, decide whether a student qualifies for extra help, or identify if a student is at risk of dropping out. Linked with other datasets, this information might be used to predict who is most likely to graduate, what their future earnings might be, how a student identifies their race or sexual orientation, who might be at risk of self-harm or substance abuse, or what their political or religious affiliation might be. While these types of processes can be used for positive ends, our society has learned that something as seemingly innocent as an online personality test can evolve into something as far-reaching as the Cambridge Analytica scandal. The possibilities for how educational data could be used and misused are endless.
I realize that this is a rhetorical flourish in a document designed to persuade, but no, the possibilities really aren't endless. If you can't train a robot tutor in the sky by having it watch you solve more geometry problems, then you can't bring Skynet to sentience that way either. I don't want to minimize real dangers. Quite the opposite. I want to make sure we aren't distracted by imaginary dangers so that we can focus on the real ones.
I'm particularly concerned by the Cambridge Analytica sentence. "Something as seemingly innocent as an online personality test can evolve into something as far-reaching...". The implication seems to be that Cambridge Analytica inferred enormous amounts of information from an online personality test. But that's not what happened. The real scandal was that Cambridge Analytica used the personality test to get users to grant them permission to enormous amounts of other data in their profile. The kind of deeply personal data that people put in Facebook but don't tend to put in their online geometry courseware. I don't see how that applies here.
Of course, the data that these companies collect in the future may change, as may our ability to infer more sensitive insights from it. Writ large, we don't have to make the kind of cut-and-dry, snapshot-in-time decision that a legal brief necessarily advocates. Rather than making a binary choice between either blithely assuming that all current and future uses of student educational data in corporate hands will be fine or assuming the dystopian opposite and denying students access to technology that even SPARC acknowledges could benefit them, the sector should be making a sustained and coordinated investment in student data ethics research. As new potential applications come online and new kinds of data are gathered, we should be pro-actively researching the implications rather than waiting until a disaster happens and hoping we can up the mess afterward.
Data permission creep
SPARC next goes on to argue that since (a) students are a captive audience and essentially have no choice but to surrender their rights if they want to get their grades, (b) professors, who would be the ones in a position to protect students' rights, don't have a good track record of protecting them from textbook prices, and (c) nobody has a good track record of reading EULAs before clicking away their rights, there is a good chance that, even if the data rights students give agree to give away are reasonable today, there is a high likelihood that they will creep into unreasonableness in the future:
Students are not only a “captive market” in terms of the cost of textbooks, they are a captive market in terms of their data. The same anticompetitive behavior that arose in the relevant market for course materials is bound to repeat itself in the relevant market for student data.
Therefore, there is potential for publishers to inflate the permissions they require students to grant in exchange for using a digital textbooks in the same way that they have inflated prices through coordinated behavior. Students will not only be paying in dollars and cents, but also in terms of their data.
I find the permissions creep argument to be compelling for several reasons. First, the question of whether people should have a right to control how their data are used is separable from the question of known harm that abuse of those data could cause. Students should have right to say how their data can be used and shared, regardless of whether that use is deemed harmful by some third party.
Second, there is an argument that SPARC missed here related to human subjects research. Currently, universities are required by law to get any experimentation with human subjects, including educational technology experiments, approved by an IRB. This includes, but is not limited to, a review of informed consent practices. Companies have no such IRB review requirement under current law. Companies with more data, more platforms, and bigger research departments can conduct more unsupervised research on students. For what it's worth, my experience is that companies that do conduct research often try to do the right thing. But that should be small comfort, for a number of reasons.
First, there is no generally agreed upon definition of what "the right thing" is, and it turns out to be very complicated. When is an activity research "on" students, and when is it "on" the software? If, for example, you move a button to test whether doing so makes a feature easier to find, but awareness of that feature turns out to make a difference in student performance, then would the company need IRB approval? If the answer "yes," and "IRB approval" for companies looks anything remotely like what it does inside universities today, then forget about getting updated software of any significance any time soon. But if the answer is "no," then where is the line, and who decides? There is basically no shared definition of ethical research for ed tech companies and no way to evaluate company practices. This is not only bad for the universities and students but also for the companies. How can they do the right thing if there is no generally accepted definition of what the right thing is?
Second, if IRB approval specifically means getting the approval of one or more university-run IRBs, and particularly if it means getting the approval of the IRB of every university for every student whose data will be examined, universities have not yet made that remotely possible to accomplish. Nor could they handle the volume. I believe that we do need companies to be conducting properly designed research into improving educational outcomes, as long as there is appropriate review of the ethical design of their studies. Right now, there is no way of guaranteeing both of these things. That is not the fault of the companies; it's a flaw in the system.
Fixing the student privacy permission problem would be hard to do in a holistic way. Some further legislation could potentially help, but I'm not at all confident that we know what that legislation should require at this point. I've written before about how federated learning analytics technical standards like IMS Caliper could theoretically enable a technical solution by enabling students to grant or deny permission to different systems that want access to their data, similarly to the way in which we grant or deny access to apps that want access to data on our phones. But that would be a long and difficult road. This is a tough nut to crack.
The research problem is also tough, but not quite as tough as the privacy permission problem. I've been speaking to some of my clients about it in an advisory capacity and working on it through the Empirical Educator Project. It is primarily a matter of political will at this point, and the pressure to solve this problem is rising on all sides.
More data means more privacy risk
For our purposes, I won't quote the entirety of SPARC's argument on this topic, but here's the nub of it:
It is common sense that the more data a company controls, the greater the risk of a breach. Recent experience demonstrates that no company can claim to be immune to the risk of data breaches, even those who can afford the most updated security measures. The size or wealth of a company has proven no obstacle to potential hackers, and in fact larger companies may become more tempting targets. Allowing more student data to become concentrated under a single company’s control increases the risk of a large scale privacy violation.
As a case in point, Pearson recently made the news for a major data breach. According to reports, the breach affected hundreds of thousands of U.S. students across more than 13,000 school and university accounts. Pearson reports that no social security numbers or financial information was compromised, but this is not the only kind of data that can cause damage. Compromising data on educational performance and personal characteristics can potentially affect students for the rest of their lives if it finds its way to employers, credit agencies, or data brokers.
While state and federal laws provide some measure of privacy protection for student records, including limiting the disclosure of personally identifiable information, they do not go far enough to prevent the increased risk of commercial exploitation of student data or protect it from potential breaches.
While we should be very concerned about student data privacy, I don't think the number of data points an education company has about a student is a good measure of the threat level. Again, a merged Cengage/McGraw-Hill would not have the same kind of data that Facebook would. We have to think very specifically about these data because they are quite different from data on the consumer web. The number of hints a student asked for in a psychology exercise or the number of algebra problems a student solved do not strike me as data that are particularly prone to abuse. These sorts of information bits comprise the bulk of the data that such companies have in their databases today. There may very well be extremely serious data privacy issues lurking here, but they will not be well measured by the volume of data collected (in contrast with, say, Google).
The point about the gaps in the laws is a much more serious one. Everybody has known for years, for example, that FERPA is badly inadequate. It is only getting worse as it ages. The Fordham paper cited by SPARC has some good suggestions. Now, if only we had a functioning Congress....
Again, I'll excerpt the SPARC filing for our purposes:
Algorithms are embedded in some digital courseware as well, including the “adaptive learning” products of the merging companies and some of their competitors. These algorithms can be as simple as grading a quiz, or as complex as changing content based its assessment of a student’s personal learning style....
While algorithms can produce positive outcomes for some students, they also carry extreme risks, as it has become increasingly clear that algorithms are not infallible. A recent program held at the Berkman Klein Center for Internet and Society at Harvard University concluded categorically that “it is impossible to create unbiased AI systems at large scale to fit all people.” Furthermore, proprietary algorithms are frequently black boxes, where it is impossible for consumers to learn what data is being interpreted and how the calculations are made—making it difficult to determine how well it is working, and whether it might have made mistakes that could end in substantial legal or reputational consequences.
Let's disambiguate a little here. There are two senses in which an algorithm could be considered a "black box." Colloquially, educators might refer to an adaptive learning or learning analytics algorithm that way if they, the educators using it, have no way of understanding how the product is making the recommendations. If an algorithm is proprietary, for example, the vendor might know why the algorithm reaches a certain result, but the educator—and student—do not.
Within the machine learning community, "black box" means something more specific. It means that the results are not explainable by any humans, including the ones who wrote the algorithm. In certain domains, there is a known trade-off between predictive accuracy and the the human interpretability of how the algorithm arrived at the prediction.
Both kinds of black boxes are very serious problems for education. In my opinion, there should be no tolerance for predictive or analytic algorithms in educational software unless they are published, peer reviewed, and preferably have replicated results by third parties. Educators and qualified researchers should know how these products work, and I do not believe that this an area where the potential benefits of commercial innovation outweigh the potential harm. Companies should not compete on secret and potentially incorrect insights about how students learn and succeed. That knowledge should be considered a public good. Education companies that truly believe in their mission statements can find other grounds for competitive advantage. This is another area that EEP is doing some early work on, though I don't have anything to announce on it just yet.
The second kind of black box—algorithms that are published and proven to work but are not explainable by humans—should be called out as such and limited to very specific kinds of low-stakes use like recommending better supplemental content from openly available resources on the internet. We should develop a set of standards for identifying applications in which we're confident that not understanding how the algorithm arrives at its recommendation does not introduce a substantial ethical risk and does produce substantial educational benefit. If the affirmative case can't be made, then the algorithm shouldn't be used.
I'm going to be a little careful with this one because, again, I am recusing myself from commenting on the merits of the brief, and this particular data topic is hardest to address while skirting the question before the DoJ. But I do want to make some light comments on the broader question of when combining different educational data sets is most potent and therefore most vulnerable to abuse.
One lesson learned from the rise of technology giants like Facebook is that preventing platform monopoly from forming is far simpler than breaking one up. Given the vast quantity of data that the combined firm would be in a position to capture and monetize, there is a real potential for it to become the next platform monopoly, which would be catastrophic for student privacy, competition, and choice.
For decades, the college course material market has been split between three giants. There is a large difference between a market split three ways and a market split two ways. As these companies aggressively push toward digital offerings and data analytics services, a divided market will limit the size and comprehensiveness of the datasets they are able to amass, and therefore the risk they pose to students and the market. So long as publishers are competing to sell the best products to institutions, and there is significantly less risk of too much student data ending up in one company’s hands.
I won't characterize the danger of combining publisher data sets beyond what I've already covered in this post. What I want to say here is that the bigger opportunity for potential insights, and therefore the bigger area of concern for potential abuse, may be when combining data sets from different kinds of learning platforms. I haven't yet seen evidence that combining data across courseware subjects yields big gains in understanding regarding individual students. But when you combine data from courseware, the LMS, clickers, the SIS, and the CRM? That combination of data has great potential for both benefit and harm to students because it provides a much richer contextual picture of the student.
While nothing in this post is intended to comment directly on the matter before the DoJ, the phrase that frames the anti-trust argument—"irreparable harm"—is one that we should think about in the larger context. I believe we have an affirmative obligation to students to develop and employ data-enabled technologies that can help them succeed, but I also believe we have an affirmative obligation to proceed in a way that prioritizes the avoidance of doing damage that can't be undone. "First, do no harm." We should be putting much more effort into thinking through ethics, designing policies, and fostering market incentives now. I don't see it happening yet, and it's not even entirely clear to me where such efforts would live.
That should trouble us all.