I had a conversation with a friend last night about the counter-intuitive challenges of working with AI/ML and the implications for using them in EdTech. I decided it might be useful to share my thinking more broadly in a blog post.
Essentially, I see three stages in working with artificial intelligence and machine learning (AI/ML). I call them the miracle, the grind, and the wall. These stages can have implications for both how we can get seduced by these technologies and how we can get bitten by them. The ethical implications are important.
One challenge with AI/ML is how deceptively easy it is to produce mind-blowing demos with these tools. For example, I spent some time playing with GPT-3 as a learning exercise. GPT-3 is one of several gigantic AI models that can do some pretty miraculous things with natural language. (Google has the other prominent gigantic model.) One reason I started with GPT-3 is that it can be programmed using natural language. For example, you can tell it, “You are a helpful chatbot that teaches first-year college students about philosophy” and voila! You have a helpful philosophy-teaching chatbot.
It’s not quite that simple. For example, I found that GPT-3’s idea of what a first-year college student understands about philosophy differed from mine. I got better results when I asked it to target 11th grade. I couldn’t have known that in advance. GPT-3 is a neural network of 175 billion parameters. It has indexed large swathes of the internet and many books. But it doesn’t store all that information, exactly. It distills it in very complex ways. In fact, GPT-3 and similar models are so complex that even the programmers who made them can’t explain why they produce specific responses to instructions or questions. So I had to figure out how to “program” my chatbot through a bit of trial and error.
Not all AI/ML algorithms are this complex. Some of them are much easier to understand. It’s a spectrum, and GPT-3 is on the far end of that spectrum.
Anyway, after a few days of intermittent tinkering, I was able to produce a chatbot that could carry out a sustained and informative conversation about David Hume’s theory of epistemology, to the point where it gave me new insights into the subject. I accomplished this by tinkering over a few days, as a layperson, using plain English.
I had reached the miracle stage of AI/ML.
But there were problems. First of all, the chatbot would end the conversation just when it got really interesting. It suddenly would insist on saying goodbye and could not be persuaded to continue talking. It turns out that this kind of AI model has a strict memory limitation. When you hit it, the chatbot suddenly forgets your entire conversation. My new philosophy tutor friend also sometimes gave weird answers. I knew when to ignore them but they would have confused some students.
GPT-3 has a community of developers who are incredibly helpful, particularly when the topic is something idealistic like education. I was able to find a very knowledgeable programmer who was generous with his time and helped me understand what I would need to do in order to take my tutor to the next level.
The first thing I’d need to do is learn to program in Python since I had reached the limits of programming in plain English. And for the 9,781st time, I was momentarily tempted to learn a little programming. But then he explained what I’d need to do.
For the memory problem, I’d need to chain portions of the conversation together. But since GPT-3 doesn’t actually remember large chunks of information so much as it distills them, the chatbot wouldn’t literally be able to recall our entire conversation. You can quickly see where this could become problematic. If the student says something like, “When you said earlier that…”, it’s hard to predict how the chatbot would respond.
And so we enter the grind phase. It could also be called the whack-a-mole phase, since you’re finding a problem, writing a solution, and then looking for unintended consequences elsewhere. Also, since the model isn’t knowably deterministic and the questions students will ask also aren’t knowably deterministic, it’s probably impossible to test all the possible scenarios.
Which is why you don’t see chatbots that are this open-ended and ambitious. Translating that initial miracle into a reliable response is a daunting if not impossible task. Today’s chatbot designers use UX and context tricks to make the inputs from the students more predictable and they also use less complex algorithms with outputs that they can predict and debug more easily. They tend to reserve the usage of models like GPT-3 for limited and specific applications. And even when they’re careful, producing a chatbot that is rock-solid reliable takes a lot of hard work, including difficult debugging that’s often quite different from debugging traditional software.
This brings us to the final phase: The wall.
Sooner or later you reach the limit of what your tech can do for you. Predicting that limit in advance takes tremendous skill, often requiring extensive domain knowledge of both the tech itself and the problem it’s being applied to, whether that’s detecting manufacturing defects, discovering new drugs, or tutoring students. There are always nooks and crannies of knowledge and skill that are a poor match for the technology’s capabilities or the data it can access.
Think about spelling and grammar checkers. They’ve been around since 1961, believe it or not. Even as recently as five or six years ago, Microsoft Word’s spell checker was so bad that I always turned it off. Today, I use Grammarly Pro, which checks spelling, and grammar, and now even makes suggestions on effective sentence structures. I love it. It makes me a better writer.
But it still makes mistakes in spelling. It makes more mistakes in grammar. And it’s writing style suggestions, while pretty good, make the sentence worse or even change it’s meaning fairly often. The reasons for these limitations are often not obvious to the layperson. For example, Wikipedia notes this eye-opening fact about spell checkers:
It might seem logical that where spell-checking dictionaries are concerned, “the bigger, the better,” so that correct words are not marked as incorrect. In practice, however, an optimal size for English appears to be around 90,000 entries. If there are more than this, incorrectly spelled words may be skipped because they are mistaken for others. For example, a linguist might determine on the basis of corpus linguistics that the word baht is more frequently a misspelling of bath or bat than a reference to the Thai currency. Hence, it would typically be more useful if a few people who write about Thai currency were slightly inconvenienced than if the spelling errors of the many more people who discuss baths were overlooked.Wikipedia
The tech has a non-obvious fundamental limitation. And in this case, not only is more data not better; more data is worse. So the idea that all AI/ML problems can be fixed with big data is flat-out false. Sometimes better data is better.
Grammar is significantly more complex than spelling and writing for clarity is significantly more complex than grammar. Each of these functions will likely hit a wall. It might not be a permanent wall, since technology improves over time. But it might be, since sometimes the limitation isn’t the tech but the nature of the problem or the data available in a form that is accessible to the tech.
The rush I felt when I learned something about philosophy from the chatbot I wrote myself is indescribable. I was a philosophy major with a particular interest in anything related to the mind or knowledge. While I don’t have an advanced degree, I certainly knew something about David Hume’s epistemology when I started the dialogue. I was certain I was seeing the future.
And maybe I was. But it isn’t the near future. When I think about the much more mature technology of the grammar checker, I wouldn’t trust it with weak writers, and certainly not with ESL students. The checker would be more prone to make mistakes and the students would be less likely, on average, to have the confidence and knowledge necessary to know when to ignore the machine. In order for me to change my mind, I’d want to see some quality IRB-approved, peer-reviewed studies showing that grammar checkers help these students rather than harm them.
We’re in a heady moment with AI/ML. I see a lot of projects rushing headlong into heavy use of the tech, often putting it into production with students without the kind of careful oversight necessary to fulfill the EdTech Hippocratic oath: First, do no harm.