In my last post, I explained how generative AI memory works and why it will always make mistakes without a fundamental change in its foundational technology. I also gave some tips for how to work around and deal with that problem to safely and productively incorporate imperfect AI into EdTech (and other uses). Today, I will draw on the memory issue I wrote about last time as a case study of why embracing our imperfect tools also means recognizing where they are likely to fail us and thinking hard about dealing realistically with their limitations.
This is part of a larger series I’m starting on a term of art called “product/market fit.” The simplest explanation of the idea is the degree to which the thing you’re building is something people want and are willing to pay the cost for, monetary or otherwise. In practice, achieving product/market fit is complex, multifaceted, and hard. This is especially true in a sector like education, where different contextual details often create the need for niche products, where the buyer, adopter, and user of the product are not necessarily the same, and where measurable goals to optimize your product for are hard to find and often viewed with suspicion.
Think about all the EdTech product categories that were supposed to be huge but disappointed expectations. MOOCs. Learning analytics. E-portfolios. Courseware platforms. And now, possibly OPMs. The list goes on. Why didn’t these product categories achieve the potential that we imagined for them? There is no one answer. It’s often in the small details specific to each situation. AI in action presents an interesting use case, partly because it’s unfolding right now, partly because it seems so easy, and partly because it’s odd and unpredictable, even to the experts. I have often written about “the miracle, the grind, and the wall” with AI. We will look at a couple of examples of moving from the miracle to the grind. These moments provide good lessons in the challenges of product/market fit.
In my next post, I’ll examine product/market fit for universities in a changing landscape, focusing on applying CBE to an unusual test case. In the third post, I’ll explore product/market fit for EdTech interoperability standards and facilitating the growth of a healthier ecosystem.
Khanmingo: the grind behind the product
Khan Academy’s Kristen DiCerbo gave us all a great service by writing openly about the challenges of producing a good AI lesson plan generator. They started with prompt engineering. Well-written prompts are miracles. They’re like magic spells. Generating a detailed lesson plan in seconds with a well-written prompt is possible. But how good is that lesson plan? How well did Khanmingo’s early prompts produce the lesson plans?
Kristen writes,
At first glance, it wasn’t bad. It produced what looked to be a decent lesson plan—at least on the surface. However, on closer inspection, we saw some issues, including the following:
Prompt Engineering a Lesson Plan: Harnessing AI for Effective Lesson Planning
- Lesson objectives just parroted the standard
- Warmups did not consistently cover the most logical prerequisite skills
- Incorrect answer keys for independent practice
- Sections of the plan were unpredictable in length and format
- The model seemed to sometimes ignore parts of the instructions in the prompt
You can’t tell the quality of the AI’s lesson plans without having experts examine them closely. You also want feedback from people who will actually use those lesson plans. I guarantee they will find problems that you will miss. Every time. Remember, the ultimate goal of product/market fit is to make something that the intended adopters will actually want. People will tolerate imperfections in a product. But which ones? What’s most important to them? How will they use the product? You can’t answer these questions confidently without the help of actual humans who would be using the product.
At any rate, Khan Academy realized their early prompt engineering attempts had several shortcomings. Here’s the first:
Khanmigo didn’t have enough information. There were too many undefined details for Khanmigo to infer and synthesize, such as state standards, target grade level, and prerequisites. Not to mention limits to Khanmigo’s subject matter expertise. This resulted in lesson plans that were too vague and/or inaccurate to provide significant value to teachers.
Prompt Engineering a Lesson Plan: Harnessing AI for Effective Lesson Planning
Read that passage carefully. With each type of information or expertise, ask yourself, “Where could I find that? Where is it written down in a form the AI can digest?” The answer is different for each one. How can the AI learn more about what state standards mean? Or about target grade levels? Prerequisites? Subject-matter expertise for each subject? No matter how much ChatGPT seems to know, it doesn’t know everything. And it is often completely ignorant about anything that isn’t well-documented on the internet. A human educator has to understand all these topics to write good lesson plans. A synthetic one does too. But a synthetic educator doesn’t have experience to draw on. It only has whatever human educators have publicly published about their experiences.
Think about the effort involved in documenting all these various types of knowledge for a synthetic educator. (This, by the way, is very similar to why learning analytics disappointed as a product category. The software needs to know too much that wasn’t available in the systems to make sense of the data.)
Here’s the second challenge that the Khanmingo team faced:
We were trying to accomplish too much with a single prompt. The longer a prompt got and the more detailed its instructions were, the more likely it was that parts of the prompt would be ignored. Trying to produce a document as complex and nuanced as a comprehensive lesson plan with a single prompt invariably resulted in lesson plans with neglected, unfocused, or entirely missing parts.
Prompt Engineering a Lesson Plan: Harnessing AI for Effective Lesson Planning
I suspect this is a subtle manifestation of the memory problem I wrote about in my last post. Even with a relatively short text like a complex prompt, the AI couldn’t hold onto all the details. The Khanmingo team ended up breaking up the prompt into smaller pieces. This produced better results as the AI could “concentrate on”—or remember the details— one step at a time. I’ll add that this approach provides more opportunities to put humans in the loop. An expert—or a user—can examine and modify the output of each step.
We fantasize about AI doing work for us. In some cases, it’s not just a fantasy. I use AI to be more productive literally every day. But it fails me often. We can’t know what it will take for AI to solve any particular problem without looking closely at the product’s capabilities and the user’s very specific needs. This is product/market fit.
Learning design in the real world
Developing skill in product/market fit is hard. Think about all those different topics the Khanmingo team needed to not only know, but know their relevance to creating lesson plans well enough to diagnose the gaps in the AI’s understanding.
Refining a product is also inherently iterative. No matter how good you are at product designer how well you know your audience, and how brilliant you are, you will be wrong about some of your ideas early on. Because people are complicated. Organizations are complicated The skills workers need are often complicated and non-obvious. And the details of how the people need to work, individually and together, are often distinctive in ways that are invisible to them. Most people only know their own context. They take a lot for granted. Good product people spend their time uncovering these invisible assumptions and finding the commonalities and the differences. This is always a discovery process that takes time.
Learning design is a classic case of this problem. People have been writing and adopting learning design methodologies longer than I’ve been alive. The ADDIE model—”Analyze, Design, Develop, Implement, and Evaluate”—was created by Florida State University for the military in the 1970s. “Backward Design” was invented in 1949 by Ralph W. Tyler. Over the past 30 years, I’ve seen a handful of learning design or instructional design tools that attempt to scaffold and enforce these and other design methodologies. I’ve yet to see one get widespread adoption. Why? Poor product/market fit.
While the goal of learning design (or “instructional design,” to use the older term) is to produce a structured learning experience, the thought process of creating it is non-linear and iterative. As we develop and draft, we see areas that need tuning or improving. We move back and forth across the process. Nobody ever follows learning design methodologies strictly in practice. And I’m talking about trained learning design professionals. Untrained educators stray even further from the model. That’s why the two most popular learning design tools, by far, are Microsoft Word and Google Docs.
If you’ve ever used ChatGPT and prompt engineering to generate the learning design of a complex lesson, you’ve probably run into unexpected limits to its usefulness. The longer you spend tinkering with the lesson, the more your results start to get worse rather than better. It’s the same problem the Khanmingo team had. Yes, ChatGPT and Claude can now have long conversations. But both research and experience show us that they tend to forget the stuff in the middle. By itself, ChatGPT is useful in lesson design to a point. But I find that when writing complex documents, I paste different pieces of my conversation into Word and stitch them together.
And that’s OK. If that process saves me design time, that’s a win. But there are use cases where the memory problems are more serious in ways that I haven’t heard folks talking about yet.
Combining documents
Here’s a very common use case in learning design:
First, you start with a draft of a lesson or a chapter that already exists. Maybe it’s a chapter from an OpenStax textbook. Maybe it’s a lesson that somebody on your team wrote a while ago that needs updating. You like it, but you don’t love it.
You have an article with much of the information you want to add to the new version you want to create. If you were using a vendor’s textbook, you’d have to require the students to read the outdated lesson and then read the article separately. But this is content you’re allowed to revise. If you’re using the article in a way that doesn’t violate copyright—for example, because you’re using it to capture publicly known facts that have changed rather than something novel in the article itself—you can simply use the new information to revise the original lesson. That was often too much work the old way. But now we have ChatGPT, so, you know…magic.
While you’re at it, you’d like to improve the lesson’s diversity, equity, and inclusion (DEI). You see opportunities to write the chapter in ways that represent more of your students and include examples relevant to their lived experiences. You happen to have a document with a good set of DEI guidelines.
So you feed your original chapter, new article, and DEI guidelines to the AI. “ChatGPT, take the original lesson and update it with the new information from the article. Then apply the DEI guidelines, including examples in topics X, Y, and Z that represent different points of view. Abracadabra!”
You can write a better prompt than this one. But no matter how carefully you engineer your prompt, you will be disappointed with the results. Don’t take my word for it. Try it yourself.
Why does this happen? Because the generative AI doesn’t “remember” these three documents perfectly. Remember what I wrote in my last article:
The LLMs can be “trained” on data, which means they store information like how “beans” vs. “water” modify the likely meaning of “cool,” what words are most likely to follow “Cool the pot off in the,” and so on. When you hear AI people talking about model “weights,” this is what they mean.
Notice, however, that none of the original sentences are stored anywhere in their original form. If the LLM is trained on Wikipedia, it doesn’t memorize Wikipedia. It models the relationships among the words using combinations of vectors (or “matrices”) and probabilities. If you dig into the LLM looking for the original Wikipedia article, you won’t find it. Not exactly. The AI may become very good at capturing the gist of the article given enough billions of those tensor/workers. But the word-for-word article has been broken down and digested. It’s gone.
How You Will Never Be Able to Trust Generative AI (and Why That’s OK)
Your lesson and articles are gone. They’ve been digested. The AI remembers them, but it’s designed to remember the meaning, not the words. It’s not metaphorically sitting down with the original copy and figuring out where to insert new information or rewrite a paragraph. That may be fine. Maybe it will produce something better. But it’s a fundamentally different process than human editing. We won’t know if the results it generates have good product/market until we test it out with folks.
To the degree that you need to preserve the fidelity of the original documents, you’ve got a problem. And the more you push generative AI to do this kind of fine-tuning work across multiple documents, the worse it gets. You’re running headlong into one of your synthetic co-worker’s fundamental limitations. Again, you might get enough value from it to achieve a net gain in productivity. But you might not because this seemingly simple use case is pushing hard on functionality that hasn’t been designed, tested, and hardened for this kind of use.
Engineering around the problem
Any product/market fit problem has two sides: product and market. On the market side, how good will be good enough? I’ve specifically positioned my ALDA project as producing a first draft with many opportunities for a human in the loop. This is a common approach we’re seeing in educational content generation right now, for good reasons. We’re reducing the risk to the students. Risk is one reason the market might reject the productket.
Another is failing to deliver the promised time savings. If the combination of the documents is too far off from the humans’ goal, it will be rejected. Its speed will not make up for the time required for the human to fix its mistakes. We have to get as close to the human need as possible, mitigate the’ consequences, and test to see if we’ve achieved a cost/benefit for the users good enough that they will adopt the product.
There is no perfect way to solve the memory problem. You will always need a human in the loop. But we could make a good step forward if we could get the designs solid enough to be directly imported into the learning platform and fine-tuned there, skipping the word processor step. Being able to do so requires tackling a host of problems, including (but not limited to) the memory issue. We don’t need the AI to get the combination of these documents perfect, but we do need it to get close enough that our users don’t need to dump the output into a full word processor to rewrite the draft.
When I raised this problem to a colleague who is a digital humanities scholar and an expert in AI, he paused before replying. “Nobody is working on this kind of problem right now,” he said. “On one side, AI experts are experimenting with improving the base models. On the other side, I see articles all the time about how educators can write better prompts. Your problem falls in between those two.”
Right. As a sector, we’re not discussing product/market fit for particular needs. The vendors are, each within their own circumscribed world. But on the customer side? I hear people tell me they’re conducting “experiments.” It sounds a bit like when university folk told me they were “working with learning analytics,” which turned out to mean that they were talking about working with learning analytics. I’m sure there are many prompt engineering workshops and many grants being written for fancy AI solutions that sound attractive to the National Science Foundation or whoever the grantor happens to be. But in the middle ground? Making AI usable to solve specific problems? I’m not seeing much of that yet.
The document combination problem can likely be addressed adequately well through a combination of approaches that improve the product and mitigate the consequences of the imperfections to make them more tolerable for the market. After consulting with some experts, I’ve come up with a combination of approaches to try first. Technologically, I know it will work. It doesn’t depend on cutting-edge developments. Will the market accept the results? Will the new approach be better than the old one? Or will it trip over some deal-breaker, like so many products before it?
I don’t know. I feel pretty good about my hypothesis. But I won’t know until real learning designers test it on real projects.
We have a dearth of practical, medium-difficulty experiments with real users right now. That is a big, big problem. It doesn’t matter how impressive the technology is if its capabilities aren’t the ones the users need to solve real-world problems. You can’t fix this gap with symposia, research grants, or even EdTech companies that have the skills but not necessarily the platform or business model you need.
The only way to do it is to get down into the weeds. Try to solve practical problems. Get real humans to tell you what does and doesn’t work for them in your first, second, third, fourth, and fifth tries. That’s what the ALDA project is all about. It’s not primarily about the end product. I am hopeful that ALDA itself will prove to be useful. But I’m not doing it because I want to commercialize a product. I’m doing it to teach and learn about product/market fit skills with AI in education. We need many more experiments like this.
We put too much faith in the miracle, forgetting the grind and the wall are out there waiting for us. Folks in the education sector spend too much time staring at the sky, waiting for the EdTech space aliens to come and take us all to paradise,.
I suggest that at least some of us should focus on solving today’s problems with today’s technology, getting it done today, while we wait for the aliens to arrive.