A few weeks ago, I saw a tweet that said “Writing code isn’t the problem. Controlling complexity is.” I wish I could remember who said that; I will be quoting it a lot in the future. That statement nicely summarizes what makes software development difficult. It’s not just memorizing the syntactic details of some programming language, or the many functions in some API, but understanding and managing the complexity of the problem you’re trying to solve.
We’ve all seen this many times. Lots of applications and tools start simple. They do 80% of the job well, maybe 90%. But that isn’t quite enough. Version 1.1 gets a few more features, more creep into version 1.2, and by the time you get to 3.0, an elegant user interface has turned into a mess. This increase in complexity is one reason that applications tend to become less useable over time. We also see this phenomenon as one application replaces another. RCS was useful, but didn’t do everything we needed it to; SVN was better; Git does just about everything you could want, but at an enormous cost in complexity. (Could Git’s complexity be managed better? I’m not the one to say.) OS X, which used to trumpet “It just works,” has evolved to “it used to just work”; the most user-centric Unix-like system ever built now staggers under the load of new and poorly thought-out features.
The problem of complexity isn’t limited to user interfaces; that may be the least important (though most visible) aspect of the problem. Anyone who works in programming has seen the source code for some project evolve from something short, sweet, and clean to a seething mass of bits. (These days, it’s often a seething mass of distributed bits.) Some of that evolution is driven by an increasingly complex world that requires attention to secure programming, cloud deployment, and other issues that didn’t exist a few decades ago. But even here: a requirement like security tends to make code more complex—but complexity itself hides security issues. Saying “yes, adding security made the code more complex” is wrong on several fronts. Security that’s added as an afterthought almost always fails. Designing security in from the start almost always leads to a simpler result than bolting security on as an afterthought, and the complexity will stay manageable if new features and security grow together. If we’re serious about complexity, the complexity of building secure systems needs to be managed and controlled in step with the rest of the software, otherwise it’s going to add more vulnerabilities.
That brings me to my main point. We’re seeing more code that’s written (at least in first draft) by generative AI tools, such as GitHub Copilot, ChatGPT (especially with Code Interpreter), and Google Codey. One advantage of computers, of course, is that they don’t care about complexity. But that advantage is also a significant disadvantage. Until AI systems can generate code as reliably as our current generation of compilers, humans will need to understand—and debug—the code they write. Brian Kernighan wrote that “Everyone knows that debugging is twice as hard as writing a program in the first place. So if you’re as clever as you can be when you write it, how will you ever debug it?” We don’t want a future that consists of code too clever to be debugged by humans—at least not until the AIs are ready to do that debugging for us. Really brilliant programmers write code that finds a way out of the complexity: code that may be a little longer, a little clearer, a little less clever so that someone can understand it later. (Copilot running in VSCode has a button that simplifies code, but its capabilities are limited.)
Furthermore, when we’re considering complexity, we’re not just talking about individual lines of code and individual functions or methods. Most professional programmers work on large systems that can consist of thousands of functions and millions of lines of code. That code may take the form of dozens of microservices running as asynchronous processes and communicating over a network. What is the overall structure, the overall architecture, of these programs? How are they kept simple and manageable? How do you think about complexity when writing or maintaining software that may outlive its developers? Millions of lines of legacy code going back as far as the 1960s and 1970s are still in use, much of it written in languages that are no longer popular. How do we control complexity when working with these?
Humans don’t manage this kind of complexity well, but that doesn’t mean we can check out and forget about it. Over the years, we’ve gradually gotten better at managing complexity. Software architecture is a distinct specialty that has only become more important over time. It’s growing more important as systems grow larger and more complex, as we rely on them to automate more tasks, and as those systems need to scale to dimensions that were almost unimaginable a few decades ago. Reducing the complexity of modern software systems is a problem that humans can solve—and I haven’t yet seen evidence that generative AI can. Strictly speaking, that’s not a question that can even be asked yet. Claude 2 has a maximum context—the upper limit on the amount of text it can consider at one time—of 100,000 tokens1; at this time, all other large language models are significantly smaller. While 100,000 tokens is huge, it’s much smaller than the source code for even a moderately sized piece of enterprise software. And while you don’t have to understand every line of code to do a high-level design for a software system, you do have to manage a lot of information: specifications, user stories, protocols, constraints, legacies and much more. Is a language model up to that?
Could we even describe the goal of “managing complexity” in a prompt? A few years ago, many developers thought that minimizing “lines of code” was the key to simplification—and it would be easy to tell ChatGPT to solve a problem in as few lines of code as possible. But that’s not really how the world works, not now, and not back in 2007. Minimizing lines of code sometimes leads to simplicity, but just as often leads to complex incantations that pack multiple ideas onto the same line, often relying on undocumented side effects. That’s not how to manage complexity. Mantras like DRY (Don’t Repeat Yourself) are often useful (as is most of the advice in The Pragmatic Programmer), but I’ve made the mistake of writing code that was overly complex to eliminate one of two very similar functions. Less repetition, but the result was more complex and harder to understand. Lines of code are easy to count, but if that’s your only metric, you will lose track of qualities like readability that may be more important. Any engineer knows that design is all about tradeoffs—in this case, trading off repetition against complexity—but difficult as these tradeoffs may be for humans, it isn’t clear to me that generative AI can make them any better, if at all.
I’m not arguing that generative AI doesn’t have a role in software development. It certainly does. Tools that can write code are certainly useful: they save us looking up the details of library functions in reference manuals, they save us from remembering the syntactic details of the less commonly used abstractions in our favorite programming languages. As long as we don’t let our own mental muscles decay, we’ll be ahead. I am arguing that we can’t get so tied up in automatic code generation that we forget about controlling complexity. Large language models don’t help with that now, though they might in the future. If they free us to spend more time understanding and solving the higher-level problems of complexity, though, that will be a significant gain.
Will the day come when a large language model will be able to write a million line enterprise program? Probably. But someone will have to write the prompt telling it what to do. And that person will be faced with the problem that has characterized programming from the start: understanding complexity, knowing where it’s unavoidable, and controlling it.
FootnotesWhile the AI group is still the largest, it’s notable that Programming, Web, and Security are all larger than they’ve been in recent months. One reason is certainly that we’re pushing AI news into other categories as appropriate. But I also think that it’s harder to impress with AI than it used to be. AI discussions have been much more about regulation and intellectual property—which makes me wonder whether legislation should be a separate category.
That notwithstanding, it’s important that OpenAI is now allowing API users to fine-tune their GPT-4 apps. It’s as-a-service, of course. And RISC-V finally appears to be getting some serious adoption. Could it compete with Atom and Intel? We shall see.
AITo follow up on our previous survey about low-code and no-code tools, we decided to run another short survey about tools specifically for software developers—including, but not limited to, GitHub Copilot and ChatGPT. We’re interested in how “developer enablement” tools of all sorts are changing the workplace. Our survey 1 showed that while these tools increased productivity, they aren’t without their costs. Both upskilling and retraining developers to use these tools are issues.
Few professional software developers will find it surprising that software development teams are respondents said that productivity is the biggest challenge their organization faced, and another 19% said that time to market and deployment speed are the biggest challenges. Those two answers are almost the same: decreasing time to market requires increasing productivity, and improving deployment speed is itself an increase in productivity. Together, those two answers represented 48% of the respondents, just short of half.
HR issues were the second-most-important challenge, but they’re nowhere near as pressing. 12% of the respondents reported that job satisfaction is the greatest challenge; 11% said that there aren’t good job candidates to hire; and 10% said that employee retention is the biggest issue. Those three challenges total 33%, just one-third of the respondents.
It’s heartening to realize that hiring and retention are still challenges in this time of massive layoffs, but it’s also important to realize that these issues are less important than productivity.
But the big issue, the issue we wanted to explore, isn’t the challenges themselves; it’s what organizations are doing to meet them. A surprisingly large percentage of respondents (28%) aren’t making any changes to become more productive. But 20% are changing their onboarding and upskilling processes, 15% are hiring new developers, and 13% are using self-service engineering platforms.
We found that the biggest struggle for developers working with new tools is training (34%), and another 12% said the biggest struggle is “ease of use.” Together, that’s almost half of all respondents (46%). That was a surprise, since many of these tools are supposed to be low- or no-code. We’re thinking specifically about tools like GitHub Copilot, Amazon CodeWhisperer, and other code generators, but almost all productivity tools claim to make life simpler. At least at first, that’s clearly not true. There’s a learning curve, and it appears to be steeper than we’d have guessed. It’s also worth noting that 13% of the respondents said that the tools “didn’t effectively solve the problems that developers face.”
Over half of the respondents (51%) said that their organizations are using self-service deployment pipelines to increase productivity. Another 13% said that while they’re using self-service pipelines, they haven’t seen an increase in productivity. So almost two-thirds of the respondents are using self-service pipelines for deployment, and for most of them, the pipelines are working—reducing the overhead required to put new projects into production.
Finally, we wanted to know specifically about the effect of GitHub Copilot, ChatGPT, and other AI-based programming tools. Two-thirds of the respondents (67%) reported that these tools aren’t in use at their organizations. We suspect this estimate is lowballing Copilot’s actual usage. Back in the early 2000s, a widely quoted survey reported that CIOs almost unanimously said that their IT organizations weren’t making use of open source. How little they knew! Actual usage of Copilot, ChatGPT, and similar tools is likely to be much higher than 33%. We’re sure that even if they aren’t using Copilot or ChatGPT on the job, many programmers are experimenting with these tools or using them on personal projects.
What about the 33% who reported that Copilot and ChatGPT are in use at their organizations? First, realize that these are early adopters: Copilot was only released a year and a half ago, and ChatGPT has been out for less than a year. It’s certainly significant that they (and similar tools) have grabbed a third of the market in that short a period. It’s also significant that making a commitment to a new way of programming—and these tools are nothing if not a new kind of programming—is a much bigger change than, say, signing up for a ChatGPT account.
11% of the respondents said their organizations use Copilot and ChatGPT, and that the tools are primarily useful to junior developers; 13% said they’re primarily useful to senior developers. Another 9% said that the tools haven’t yielded an increase in productivity. The difference between junior and senior developers is closer than we expected. Common wisdom is that Copilot is more of an advantage to senior programmers, who are better able to describe the problem they need to solve in an intricate set of prompts and to notice bugs in the generated code quickly. Our survey hints that the difference between senior and junior developers is relatively small—although they’re almost certainly using Copilot in different ways. Junior developers are using it to learn and to spend less time solving problems by looking up solutions on Stack Overflow or searching online documentation. Senior developers are using it to help design and structure systems, and even to create production code.
Is developer productivity an issue? Of course; it always is. Part of the solution is improved tooling: self-service deployment, code-generation tools, and other new technologies and ideas. Productivity tools—and specifically the successors to tools like Copilot—are remaking software development in radical ways. Software developers are getting value from these tools, but don’t let the buzz fool you: that value doesn’t come for free. Nobody’s going to sit down with ChatGPT, type “Generate an enterprise application for selling shoes,” and come away with something worthwhile. Each has its own learning curve, and it’s easy to underestimate how steep that curve can be. Developer productivity tools will be a big part of the future; but to take full advantage of those tools, organizations will need to plan for skills development.
I’m sure that nobody will be surprised that the number of searches for ChatGPT on the O’Reilly learning platform skyrocketed after its release in November, 2022. It might be a surprise how quickly it got to the top of our charts: it peaked in May as the 6th most common search query. Then it dropped almost as quickly: it dropped back to #8 in June, and fell further to #19 in July. At its peak, ChatGPT was in very exclusive company: it’s not quite on the level of Python, Kubernetes, and Java, but it’s in the mix with AWS and React, and significantly ahead of Docker.
A look at the number of searches for terms commonly associated with AI shows how dramatic this rise was:
ChatGPT came from nowhere to top all the AI search terms except for Machine Learning itself, which is consistently our #3 search term—and, despite ChatGPT’s dramatic decline in June and July, it’s still ahead of all other search terms relevant to AI. The number of searches for Machine Learning itself held steady, though it arguably declined slightly when ChatGPT appeared. What’s more interesting, though, is that the search term “Generative AI” suddenly emerged from the pack as the third most popular search term. If current trends continue, in August we might see more searches for Generative AI than for ChatGPT.
What can we make of this? Everyone knows that ChatGPT had one of the most successful launches of any software project, passing a million users in its first five days. (Since then, it’s been beaten by Facebook’s Threads, though that’s not really a fair comparison.) There are plenty of reasons for this surge. Talking computers have been a science fiction dream since well before Star Trek—by itself, that’s a good reason for the public’s fascination. ChatGPT might simplify common tasks, from doing research to writing essays to basic programming, so many people want to use it to save labor—though getting it to do quality work is more difficult than it seems at first glance. (We’ll leave the issue of whether this is “cheating” to the users, their teachers, and their employers.) And, while I’ve written frequently about how ChatGPT will change programming, it will undoubtedly have an even greater effect on non-programmers. It will give them the chance to tell computers what to do without programming; it’s the ultimate “low code” experience.
So there are plenty of reasons for ChatGPT to surge. What about other search terms? It’s easy to dismiss these search queries as also-rans, but they were all in the top 300 for May, 2023—and we typically have a few million unique search terms per month. Removing ChatGPT and Machine Learning from the previous graph makes it easier to see trends in the other popular search terms:
It’s mostly “up and to the right.” Three search terms stand out: Generative AI, LLM, and Langchain all follow similar curves: they start off with relatively moderate growth that suddenly becomes much steeper in February, 2023. We’ve already noted that the number of searches for Generative AI increased sharply since the release of ChatGPT, and haven’t declined in the past two months. Our users evidently prefer LLM to spelling out “Large Language Models,” but if you add these two search terms together, the total number of searches for July is within 1% of Generative AI. This surge didn’t really start until last November, when it was spurred by the appearance of ChatGPT—even though search terms like LLM were already in circulation because of GPT-3, DALL-E, StableDiffusion, Midjourney, and other language-based generative AI tools.
Unlike LLMs, Langchain didn’t exist prior to ChatGPT—but once it appeared, the number of searches took off rapidly, and didn’t decline in June and July. That makes sense; although it’s still early, Langchain looks like it will be the cornerstone of LLM-based software development. It’s a widely used platform for building applications that generate queries programmatically and that connects LLMs with each other, with databases, and with other software. Langchain is frequently used to look up relevant articles that weren’t in ChatGPT’s training data and package them as part of a lengthy prompt.
In this group, the only search term that seems to be in a decline is Natural Language Processing. Although large language models clearly fall into the category of NLP, we suspect that most users associate NLP with older approaches to building chatbots. Searches for Artificial Intelligence appear to be holding their own, though it’s surprising that there are so few searches for AI compared to Machine Learning. The difference stems from O’Reilly’s audience, which is relatively technical and prefers the more precise term Machine Learning. Nevertheless, the number of searches for AI rose with the release of ChatGPT, possibly because ChatGPT’s appeal wasn’t limited to the technical community.
Now that we’ve run through the data, we’re left with the big question: What happened to ChatGPT? Why did it decline from roughly 5,000 searches to slightly over 2,500 in a period of two months? There are many possible reasons. Perhaps students stopped using ChatGPT for homework assignments as graduation and summer vacation approached. Perhaps ChatGPT has saturated the world; people know what they need to know, and are waiting for the next blockbuster. An article in Ars Technica notes that ChatGPT usage declined from May to June, and suggests many possible causes, including attention to the Twitter/Threads drama and frustration because OpenAI implemented stricter guardrails to prevent abuse. It would be unfortunate if ChatGPT usage is declining because people can’t use it to generate abusive content, but that’s a different article…
A more important reason for this decline might be that ChatGPT is no longer the only game in town. There are now many alternative language models. Most of these alternatives descend from Meta’s LLaMA and Georgi Gerganov’s llama.cpp (which can run on laptops, cell phones, and even Raspberry Pi). Users can train these models to do whatever they want. Some of these models already have chat interfaces, and all of them could support chat interfaces with some fairly simple programming. None of these alternatives generate significant search traffic at O’Reilly, but that doesn’t mean that they won’t in the future, or that they aren’t an important part of the ecosystem. Their proliferation is an important piece of evidence about what’s happening among O’Reilly’s users. AI developers now need to ask a question that didn’t even exist last November: should they build on large foundation models like ChatGPT or Google’s Bard, using public APIs and paying by the token? Or should they start with an open source model that can run locally and be trained for their specific application?
This last explanation makes a lot of sense in context. We’ve moved beyond the initial phase, when ChatGPT was a fascinating toy. We’re now building applications and incorporating language models into products, so trends in search terms have shifted accordingly. A developer interested in building with large language models needs more context; learning about ChatGPT by itself isn’t enough. Developers who want to learn about language models need different kinds of information, information that’s both deeper and broader. They need to learn about how generative AI works, about new LLMs, about programming with Langchain and other platforms. All of these search terms increased while ChatGPT declined. Now that there are options, and now that everyone has had a chance to try out ChatGPT, the first step in an AI project isn’t to search for ChatGPT. It’s to get a sense of the landscape, to discover the possibilities.
Searches for ChatGPT peaked quickly, and are now declining rapidly—and who knows what August and September will bring? (We wouldn’t be surprised to see ChatGPT bounce back as students return to school and homework assignments.) The real news is that ChatGPT is no longer the whole story: you can’t look at the decline in ChatGPT without also considering what else our users are searching for as they start building AI into other projects. Large language models are very clearly part of the future. They will change the way we work and live, and we’re just at the start of the revolution.
Artificial Intelligence continues to dominate the news. In the past month, we’ve seen a number of major updates to language models: Claude 2, with its 100,000 token context limit; LLaMA 2, with (relatively) liberal restrictions on use; and Stable Diffusion XL, a significantly more capable version of Stable Diffusion. Does Claude 2’s huge context really change what the model can do? And what role will open access and open source language models have as commercial applications develop?
Artificial IntelligenceIf you’re reading this, chances are you’ve played around with using AI tools like ChatGPT or GitHub Copilot to write code for you. Or even if you haven’t yet, then you’ve at least heard about these tools in your newsfeed over the past year. So far I’ve read a gazillion blog posts about people’s experiences with these AI coding assistance tools. These posts often recount someone trying ChatGPT or Copilot for the first time with a few simple prompts, seeing how it does for some small self-contained coding tasks, and then making sweeping claims like “WOW this exceeded all my highest hopes and wildest dreams, it’s going to replace all programmers in five years!” or “ha look how incompetent it is … it couldn’t even get my simple question right!”
I really wanted to go beyond these quick gut reactions that I’ve seen so much of online, so I tried using ChatGPT for a few weeks to help me implement a hobby software project and took notes on what I found interesting. This article summarizes what I learned from that experience. The inspiration (and title) for it comes from Mike Loukides’ Radar article on Real World Programming with ChatGPT, which shares a similar spirit of digging into the potential and limits of AI tools for more realistic end-to-end programming tasks.
Setting the Stage: Who Am I and What Am I Trying to Build?I’m a professor who is interested in how we can use LLMs (Large Language Models) to teach programming. My student and I recently published a research paper on this topic, which we summarized in our Radar article Teaching Programming in the Age of ChatGPT. Our paper reinforces the growing consensus that LLM-based AI tools such as ChatGPT and GitHub Copilot can now solve many of the small self-contained programming problems that are found in introductory classes. For instance, problems like “write a Python function that takes a list of names, splits them by first and last name, and sorts by last name.” It’s well-known that current AI tools can solve these kinds of problems even better than many students can. But there’s a huge difference between AI writing self-contained functions like these and building a real piece of software end-to-end. I was curious to see how well AI could help students do the latter, so I wanted to first try doing it myself.
I needed a concrete project to implement with the help of AI, so I decided to go with an idea that had been in the back of my head for a while now: Since I read a lot of research papers for my job, I often have multiple browser tabs open with the PDFs of papers I’m planning to read. I thought it would be cool to play music from the year that each paper was written while I was reading it, which provides era-appropriate background music to accompany each paper. For instance, if I’m reading a paper from 2019, a popular song from that year could start playing. And if I switch tabs to view a paper from 2008, then a song from 2008 could start up. To provide some coherence to the music, I decided to use Taylor Swift songs since her discography covers the time span of most papers that I typically read: Her main albums were released in 2006, 2008, 2010, 2012, 2014, 2017, 2019, 2020, and 2022. This choice also inspired me to call my project Swift Papers.
Swift Papers felt like a well-scoped project to test how well AI handles a realistic yet manageable real-world programming task. Here’s how I worked on it: I subscribed to ChatGPT Plus and used the GPT-4 model in ChatGPT (first the May 12, 2023 version, then the May 24 version) to help me with design and implementation. I also installed the latest VS Code (Visual Studio Code) with GitHub Copilot and the experimental Copilot Chat plugins, but I ended up not using them much. I found it easier to keep a single conversational flow within ChatGPT rather than switching between multiple tools. Lastly, I tried not to search for help on Google, Stack Overflow, or other websites, which is what I would normally be doing while programming. In sum, this is me trying to simulate the experience of relying as much as possible on ChatGPT to get this project done.
Getting Started: Setup Trials and TribulationsHere’s the exact prompt I used to start my conversation with ChatGPT using GPT-4:
Act as a software developer to help me build something that will play music from a time period that matches when an academic paper I am reading in the browser was written.
I purposely kept this prompt high-level and underspecified since I wanted ChatGPT to guide me toward design and implementation ideas without me coming in with preconceived notions.
ChatGPT immediately suggested a promising direction—making a browser extension that gets the date of the research paper PDF in the currently-active tab and calls a music streaming API to play a song from that time period. Since I already had a YouTube Music account, I asked whether I could use it, but ChatGPT said that YouTube Music doesn’t have an API. We then brainstormed alternative ideas like using a browser automation tool to programmatically navigate and click on parts of the YouTube Music webpage. ChatGPT gave me some ideas along these lines but warned me that, “It’s important to note that while this approach doesn’t use any official APIs, it’s more brittle and more subject to break if YouTube Music changes their website structure. […] keep in mind that web scraping and browser automation can be complex, and handling all of the edge cases can be a significant amount of work. […] using APIs might be a more reliable and manageable solution.” That warning convinced me to drop this idea. I recalled that ChatGPT had recommended the Spotify Web API in an earlier response, so I asked it to teach me more about what it can do and tell me why I should use it rather than YouTube Music. It seemed like Spotify had what I needed, so I decided to go with it. I liked how ChatGPT helped me work through the tradeoffs of these initial design decisions before diving head-first into coding.
Next we worked together to set up the boilerplate code for a Chrome browser extension, which I’ve never made before. ChatGPT started by generating a manifest.json file for me, which holds the configuration settings that every Chrome extension needs. I didn’t know it at the time, but manifest.json would cause me a bunch of frustration later on. Specifically:
Wrestling with all these finicky details of manifest.json before I could begin any real coding felt like death by a thousand cuts. In addition, ChatGPT generated other starter code in the chat, which I copied into new files in my VS Code project:
Intermission 1: ChatGPT as a Personalized TutorAs shown above, a typical Chrome extension like mine has at least three JavaScript files: a background script, a content script, and a pop-up script. At this point I wanted to learn more about what all these files are meant to do rather than continuing to obediently copy-paste code from ChatGPT into my project. Specifically, I discovered that each file has different permissions for what browser or page components it can access, so all three must coordinate to make the extension work as intended. Normally I would read tutorials about how this all fits together, but the problem with tutorials is that they are not customized to my specific use case. Tutorials provide generic conceptual explanations and use made-up toy examples that I can’t relate to. So I end up needing to figure out how their explanations may or may not apply to my own context.
In contrast, ChatGPT can generate personalized tutorials that use my own Swift Papers project as the example in its explanations! For instance, when it explained to me what a content script does, it added that “For your specific project, a content script would be used to extract information (the publication date) from the academic paper’s webpage. The content script can access the DOM of the webpage, find the element that contains the publication date, and retrieve the date.” Similarly, it taught me that “Background scripts are ideal for handling long-term or ongoing tasks, managing state, maintaining databases, and communicating with remote servers. In your project, the background script could be responsible for communicating with the music API, controlling the music playback, and storing any data or settings that need to persist between browsing sessions.”
I kept asking ChatGPT follow-up questions to get it to teach me more nuances about how Chrome extensions worked, and it grounded its explanations in how those concepts applied to my Swift Papers project. To accompany its explanations, it also generated relevant example code that I could try out by running my extension. These explanations clicked well in my head because I was already deep into working on Swift Papers. It was a much better learning experience than, say, reading generic getting-started tutorials that walk through creating example extensions like “track your page reading time” or “remove clutter from a webpage” or “manage your tabs better” … I couldn’t bring myself to care about those examples since THEY WEREN’T RELEVANT TO ME! At the time, I cared only about how these concepts applied to my own project, so ChatGPT shined here by generating personalized mini-tutorials on-demand.
Another great side-effect of ChatGPT teaching me these concepts directly within our ongoing chat conversation is that whenever I went back to work on Swift Papers after a few days away from it, I could scroll back up in the chat history to review what I recently learned. This reinforced the knowledge in my head and got me back into the context of resuming where I last left off. To me, this is a huge benefit of a conversational interface like ChatGPT versus an IDE autocomplete interface like GitHub Copilot, which doesn’t leave a trace of its interaction history. Even though I had Copilot installed in VS Code as I was working on Swift Papers, I rarely used it (beyond simple autocompletions) since I liked having a chat history in ChatGPT to refer back to in later sessions.
Next Up: Choosing and Installing a Date Parsing LibraryIdeally Swift Papers would infer the date when an academic paper was written by analyzing its PDF file, but that seemed too hard to do since there isn’t a standard place within a PDF where the publication date is listed. Instead what I decided to do was to parse the “landing pages” for each paper that contains metadata such as its title, abstract, and publication date. Many papers I read are linked from a small handful of websites, such as the ACM Digital Library, arXiv, or Google Scholar, so I could parse the HTML of those landing pages to extract publication dates. For instance, here’s the landing page for the classic Beyond being there paper:
I wanted to parse the “Published: 01 June 1992” string on that page to get 1992 as the publication year. I could’ve written this code by hand, but I wanted to try using a JavaScript date parsing library since it would be more robust to date format variations that appear on various websites (e.g., using “22” for the year 2022). Also, since any real-world software project will need to use external libraries, I wanted to see how well ChatGPT could help me choose and install libraries.
ChatGPT suggested two libraries for me: Moment.js and chrono-node. However, it warned me about Moment.js: “as of September 2020, it is considered a legacy project and not recommended for new projects as the team is not planning on doing any new development or maintenance.” I verified this was true by seeing the same warning on the Moment.js homepage. But still, I liked how Moment.js was available as a single self-contained file that I could directly include into my extension without using a package manager like npm or a bundler like webpack (the fewer external tools I needed to set up, the better!). Or so I thought … ChatGPT led me to believe that I could get by without npm and webpack, but later I discovered that this only works in the old Manifest v2 extension format that is no longer supported by Chrome. It turns out that with Manifest v3, all external libraries must be bundled together using tools like npm and webpack in order for the extension to import them. So it goes… once more I got bitten by the fact that ChatGPT was biased toward producing outdated information for v2 rather than the newer v3, presumably because there was a lot more information about v2 in its training data.
Since I had to use npm and webpack anyways, I decided to go with chrono-node since it seemed more robust and up-to-date (no pun intended). I had ChatGPT help me set up my webpack configuration file (webpack.config.js) and things almost seemed to work, except that I got a mysterious error. When I pasted the error message into ChatGPT, it correctly diagnosed the problem, which was something related to (surprise surprise!) webpack’s default settings not being compatible with the security restrictions of my extension’s Manifest v3 format. It also suggested a working fix to webpack.config.js:
This error message is related to the Content Security Policy (CSP) of Chrome extensions. […] By default, Manifest V3 disallows the use of eval() and the Function() constructor, which are typically used for evaluating or creating code from strings […] Webpack, by default, includes a small amount of inline JavaScript code for handling hot module replacement and some other features. This code uses eval() which is not allowed in the context of a Chrome extension. To solve this, you can adjust your webpack configuration to output code in a format that doesn’t rely on eval(). In your webpack.config.js, you can set the devtool option to ‘none’ or use the ‘source-map’ setting […]
Here again ChatGPT showed me that it clearly knew what the problem was (since it told me after I fed it the error message!) and how to fix it. So why didn’t it produce the correct webpack configuration file in the first place?
More generally, several times I’ve seen ChatGPT produce code that I felt might be incorrect. Then when I tell it that there might be a bug in a certain part, it admits its mistake and produces the correct code in response. If it knew that its original code was incorrect, then why didn’t it generate the correct code in the first place?!? Why did I have to ask it to clarify before it admitted its mistake? I’m not an expert at how LLMs work internally, but my layperson guess is that it may have to do with the fact that ChatGPT generates code linearly one token at a time, so it may get ‘stuck’ near local maxima (with code that mostly works but is incorrect in some way) while it is navigating the enormous abstract space of possible output code tokens; and it can’t easily backtrack to correct itself as it generates code in a one-way linear stream. But after it finishes generating code, when the user asks it to review that code for possible errors, it can now “see” and analyze all of that code at once. This comprehensive view of the code may enable ChatGPT to find bugs better, even if it couldn’t avoid introducing those bugs in the first place due to how it incrementally generates code in a one-way stream. (This isn’t an accurate technical explanation, but it’s how I informally think about it.)
Intermission 2: ChatGPT as a UX Design ConsultantNow that I had a basic Chrome extension that could extract paper publication dates from webpages, the next challenge was using the Spotify API to play era-appropriate Taylor Swift songs to accompany these papers. But before embarking on another coding-intensive adventure, I wanted to switch gears and think more about UX (user experience). I got so caught up in the first few hours of getting my extension set up that I hadn’t thought about how this app ought to work in detail. What I needed at this time was a UX design consultant, so I wanted to see if ChatGPT could play this role.
Note that up until now I had been doing everything in one long-running chat session that focused on coding-related questions. That was great because ChatGPT was fully “in the zone” and had a very long conversation (spanning several hours over multiple days) to use as context for generating code suggestions and technical explanations. But I didn’t want all that prior context to influence our UX discussion, so I decided to begin again by starting a brand-new session with the following prompt:
You are a Ph.D. graduate in Human-Computer Interaction and now a senior UX (user experience) designer at a top design firm. Thus, you are very familiar with both the experience of reading academic papers in academia and also designing amazing user experiences in digital products such as web applications. I am a professor who is creating a Chrome Extension for fun in order to prototype the following idea: I want to make the experience of reading academic papers more immersive by automatically playing Taylor Swift songs from the time period when each paper was written while the reader is reading that particular paper in Chrome. I have already set up all the code to connect to the Spotify Web API to programmatically play Taylor Swift songs from certain time periods. I have also already set up a basic Chrome Extension that knows what webpages the user has open in each tab and, if it detects that a webpage may contain metadata about an academic paper then it parses that webpage to get the year the paper was written in, in order to tell the extension what song to play from Spotify. That is the basic premise of my project.
Your job is to serve as a UX design consultant to help me design the user experience for such a Chrome Extension. Do not worry about whether it is feasible to implement the designs. I am an experienced programmer so I will tell you what ideas are or are not feasible to implement. I just want your help with thinking through UX design.
As our session progressed, I was very impressed with ChatGPT’s ability to help me brainstorm how to handle different user interaction scenarios. That said, I had to give it some guidance upfront using my knowledge of UX design: I started by asking it to come up with a few user personas and then to build up some user journeys for each. Given this initial prompting, ChatGPT was able to help me come up with practical ideas that I didn’t originally consider all too well, especially for handling unusual edge cases (e.g., what should happen to the music when the user switches between tabs very quickly?). The back-and-forth conversational nature of our chat made me feel like I was talking to a real human UX design consultant.
I had a lot of fun working with ChatGPT to refine my initial high-level ideas into a detailed plan for how to handle specific user interactions within Swift Papers. The culmination of our consulting session was ChatGPT generating ASCII diagrams of user journeys through Swift Papers, which I could later refer to when implementing this logic in code. Here’s one example:
Reflecting back, this session was productive because I was familiar enough with UX design concepts to steer the conversation towards more depth. Out of curiosity, I started a new chat session with exactly the same UX consultant prompt as above but then played the part of a total novice instead of guiding it:
I don’t know anything about UX design. Can you help me get started since you are the expert?
The conversation that followed this prompt was far less useful since ChatGPT ended up giving me a basic primer on UX Design 101 and offering high-level suggestions for how I can start thinking about the user experience of Swift Papers. I didn’t want to nudge it too hard since I was pretending to be a novice, and it wasn’t proactive enough to ask me clarifying questions to probe deeper. Perhaps if I had prompted it to be more proactive at the start, then it could have elicited more information even from a novice.
This digression reinforces the widely-known consensus that what you get out of LLMs like ChatGPT is only as good as the prompts you’re able to put in. There’s all of this relevant knowledge hiding inside its neural network mastermind of billions and billions of LLM parameters, but it’s up to you to coax it into revealing what it knows by taking the lead in conversations and crafting the right prompts to direct it toward useful responses. Doing so requires a degree of expertise in the domain you’re asking about, so it’s something that beginners would likely struggle with.
The Last Big Hurdle: Working with the Spotify APIAfter ChatGPT helped me with UX design, the last hurdle I had to overcome was figuring out how to connect my Chrome extension to the Spotify Web API to select and play music. Like my earlier adventure with installing a date parsing library, connecting to web APIs is another common real-world programming task, so I wanted to see how well ChatGPT could help me with it.
The gold standard here is an expert human programmer who has a lot of experience with the Spotify API and who is good at teaching novices. ChatGPT was alright for getting me started but ultimately didn’t meet this standard. My experience here showed me that human experts still outperform the current version of ChatGPT along the following dimensions:
In the end I got this Spotify API setup working by doing some old-fashioned web searching to supplement my ChatGPT conversation. (I did try the ChatGPT + Bing web search plugin for a bit, but it was slow and didn’t produce useful results, so I couldn’t tolerate it any more and just shut it off.) The breakthrough came as I was browsing a GitHub repository of Spotify Web API example code. I saw an example for Node.js that seemed to do what I wanted, so I copy-pasted that code snippet into ChatGPT and told it to adapt the example for my Swift Papers app (which isn’t using Node.js):
Here’s some example code using Implicit Grant Flow from Spotify’s documentation, which is for a Node.js app. Can you adapt it to fit my chrome extension? [I pasted the code snippet here]
ChatGPT did a good job at “translating” that example into my context, which was exactly what I needed at the moment to get unstuck. The code it generated wasn’t perfect, but it was enough to start me down a promising path that would eventually lead me to get the Spotify API working for Swift Papers. Reflecting back, I later realized that I had manually done a simple form of RAG (Retrieval Augmented Generation) here by using my intuition to retrieve a small but highly-relevant snippet of example code from the vast universe of all code on the internet and then asking a super-specific question about it. (However, I’m not sure a beginner would be able to scour the web to find such a relevant piece of example code like I did, so they would probably still be stuck at this step because ChatGPT alone wasn’t able to generate working code without this extra push from me.)
Epilogue: What Now?I have a confession: I didn’t end up finishing Swift Papers. Since this was a hobby project, I stopped working on it after about two weeks when my day-job got more busy. However, I still felt like I completed the initial hard parts and got a sense of how ChatGPT could (and couldn’t) help me along the way. To recap, this involved:
After laying this groundwork, I was able to start getting into the flow of an edit-run-debug cycle where I knew exactly where to add code to implement a new feature, how to run it to assess whether it did what I intended, and how to debug. So even though I stopped working on this project due to lack of time, I got far enough to see how completing Swift Papers would be “just a matter of programming.” Note that I’m not trying to trivialize the challenges involved in programming, since I’ve done enough of it to know that the devil is in the details. But these coding-specific details are exactly where AI tools like ChatGPT and GitHub Copilot shine! So even if I had continued adding features throughout the coming weeks, I don’t feel like I would’ve gotten any insights about AI tools that differ from what many others have already written about. That’s because once the software environment has been set up (e.g., libraries, frameworks, build systems, permissions, API authentication keys, and other plumbing to hook things together), then the task at hand reduces to a self-contained and well-defined programming problem, which AI tools excel at.
In sum, my goal in writing this article was to share my experiences using ChatGPT for the more open-ended tasks that came before my project turned into “just a matter of programming.” Now, some may argue that this isn’t “real” programming since it feels like just a bunch of mundane setup and configuration work. But I believe that if “real-world” programming means creating something realistic with code, then “real-real-world” programming (the title of this article!) encompasses all these tedious and idiosyncratic errands that are necessary before any real programming can begin. And from what I’ve experienced so far, this sort of work isn’t something humans can fully outsource to AI tools yet. Long story short, someone today can’t just give AI a high-level description of Swift Papers and have a robust piece of software magically pop out the other end. I’m sure people are now working on the next generation of AI that can bring us closer to this goal (e.g., much longer context windows with Claude 2 and retrieval augmented generation with Cody), so I’m excited to see what’s in store. Perhaps future AI tool developers could use Swift Papers as a benchmark to assess how well their tool performs on an example real-real-world programming task. Right now, widely-used benchmarks for AI code generation (e.g., HumanEval, MBPP) consist of small self-contained tasks that appear in introductory classes, coding interviews, or programming competitions. We need more end-to-end, real-world benchmarks to drive improvements in these AI tools.
Lastly, switching gears a bit, I also want to think more in the future about how AI tools can teach novices the skills they need to create realistic software projects like Swift Papers rather than doing all the implementation work for them. At present, ChatGPT and Copilot are reasonably good “doers” but not nearly as good at being teachers. This is unsurprising since they were designed to carry out instructions like a good assistant would, not to be an effective teacher who provides pedagogically-meaningful guidance. With the proper prompting and fine-tuning, I’m sure they can do much better here, and organizations like Khan Academy are already customizing GPT-4 to become a personalized tutor. I’m excited to see how things progress in this fast-moving space in the coming months and years. In the meantime, for more thoughts about AI coding tools in education, check out this other recent Radar article that I co-authored, Teaching Programming in the Age of ChatGPT, which summarizes our research paper about this topic.
Imagine for a minute that you’re a programming instructor who’s spent many hours making creative homework problems to introduce your students to the world of programming. One day, a colleague tells you about an AI tool called ChatGPT. To your surprise (and alarm), when you give it your homework problems, it solves most of them perfectly, maybe even better than you can! You realize that by now, AI tools like ChatGPT and GitHub Copilot are good enough to solve all of your class’s homework problems and affordable enough that any student can use them. How should you teach students in your classes knowing that these AI tools are widely available?
I’m Sam Lau from UC San Diego, and my Ph.D. advisor (and soon-to-be faculty colleague) Philip Guo and I are presenting a research paper at the International Computing Education Research conference (ICER) on this very topic. We wanted to know:
How are computing instructors planning to adapt their courses as more and more students start using AI coding assistance tools such as ChatGPT and GitHub Copilot?
To answer this question, we gathered a diverse sample of perspectives by interviewing 20 introductory programming instructors at universities across 9 countries (Australia, Botswana, Canada, Chile, China, Rwanda, Spain, Switzerland, United States) spanning all 6 populated continents. To our knowledge, our paper is the first empirical study to gather instructor perspectives about these AI coding tools that more and more students will likely have access to in the future.
Here’s a summary of our findings:
Short-Term Plans: Instructors Want to Stop Students from CheatingEven though we didn’t specifically ask about cheating in our interviews, all of the instructors we interviewed mentioned it as a primary reason to make changes to their courses in the short term. Their reasoning was: If students could easily get answers to their homework questions using AI tools, then they won’t need to think deeply about the material, and thus won’t learn as much as they should. Of course, having an answer key isn’t a new problem for instructors, who have always worried about students copying off each other or online resources like Stack Overflow. But AI tools like ChatGPT generate code with slight variations between responses, which is enough to fool most plagiarism detectors that instructors have available today.
The deeper issue for instructors is that if AI tools can easily solve problems in introductory courses, students who are learning programming for the first time might be led to believe that AI tools can correctly solve any programming task, which can cause them to grow overly reliant on them. One instructor described this as not just cheating, but “cheating badly” because AI tools generate code that’s incorrect in subtle ways that students might not be able to understand.
To discourage students from becoming over-reliant on AI tools, instructors used a mix of strategies, including making exams in-class and on-paper, and also having exams count for more of students’ final grades. Some instructors also explicitly banned AI tools in class, or exposed students to the limitations of AI tools. For example, one instructor copied old homework questions into ChatGPT as a live demo in a lecture and asked students to critique the strengths and weaknesses of the AI-generated code. That said, instructors considered these strategies short-term patches; the sudden appearance of ChatGPT at the end of 2022 meant that instructors needed to make adjustments before their courses started in 2023, which was when we interviewed them for our study.
Longer-Term Plans (Part 1): Ideas to Resist AI ToolsIn the next part of our study, instructors brainstormed many ideas about how to approach AI tools longer-term. We split up these ideas into two main categories: ideas that resist AI tools, and ideas that embrace them. Do note that most instructors we interviewed weren’t completely on one side or the other—they shared a mix of ideas from both categories. That said, let’s start with why some instructors talked about resisting AI tools, even in the longer term.
The most common reason for wanting to resist AI tools was the concern that students wouldn’t learn the fundamentals of programming. Several instructors drew an analogy to using a calculator in math class: using AI tools could be like, in the words of one of our interview participants, “giving kids a calculator and they can play around with a calculator, but if they don’t know what a decimal point means, what do they really learn or do with it? They may not know how to plug in the right thing, or they don’t know how to interpret the answer.” Others mentioned ethical objections to AI. For example, one instructor was worried about recent lawsuits around Copilot’s use of open-source code as training data without attribution. Others shared concerns over the training data bias for AI tools.
To resist AI tools practically, instructors proposed ideas for designing “AI-proof” homework assignments, for example, by using a custom-built library for their course. Also, since AI tools are typically trained on U.S./English-centric data, instructors from other countries thought that they could make their assignments harder for AI to solve by including local cultural and language context (e.g. slang) from their countries.
Instructors also brainstormed ideas for AI-proof assessments. One common suggestion was to use in-person paper exams since proctors could better ensure that students were only using paper and pencil. Instructors also mentioned that they could try oral exams where students either talk to a course staff member in-person, or record a video explaining what their code does. Although these ideas were first suggested to help keep assessments meaningful, instructors also pointed out that these assessments could actually improve pedagogy by giving students a reason to think more deeply about why their code works rather than simply trying to get code that produces a correct answer.
Longer-Term Plans (Part 2): Ideas to Embrace AI ToolsAnother group of ideas sought to embrace AI tools in introductory programming courses. The instructors we interviewed mentioned several reasons for wanting this future. Most commonly, instructors felt that AI coding tools would become standard for programmers; since “it’s inevitable” that professionals will use AI tools on the job, instructors wanted to prepare students for their future jobs. Related to this, some instructors thought that embracing AI tools could make their institutions more competitive by getting ahead of other universities that were more hesitant about doing so.
Instructors also saw potential learning benefits to using AI tools. For example, if these tools make it so that students don’t need to spend as long wrestling with programming syntax in introductory courses, students could spend more time learning about how to better design and engineer programs. One instructor drew an analogy to compilers: “We don’t need to look at 1’s and 0’s anymore, and nobody ever says, ‘Wow what a big problem, we don’t write machine language anymore!’ Compilers are already like AI in that they can outperform the best humans in generating code.” And in contrast to concerns that AI tools could harm equity and access, some instructors thought that they could make programming less intimidating and thus more accessible by letting students start coding using natural language.
Instructors also saw many potential ways to use AI tools themselves. For example, many taught courses with over a hundred students, where it would be too time-consuming to give individual feedback to each student. Instructors thought that AI tools trained on their class’s data could potentially give personalized help to each student, for example by explaining why a piece of code doesn’t work. Instructors also thought AI tools could help generate small practice problems for their students.
To prepare students for a future where AI tools are widespread, instructors mentioned that they could spend more time in class on code reading and critique rather than writing code from scratch. Indeed, these skills could be useful in the workplace even today, where programmers spend significant amounts of time reading and reviewing other people’s code. Instructors also thought that AI tools gave them the opportunity to give more open-ended assignments, and even have students collaborate with AI directly on their work, where an assignment would ask students to generate code using AI and then iterate on the code until it was both correct and efficient.
ReflectionsOur study findings capture a rare snapshot in time in early 2023 as computing instructors are just starting to form opinions about this fast-growing phenomenon but have not yet converged to any consensus about best practices. Using these findings as inspiration, we synthesized a diverse set of open research questions regarding how to develop, deploy, and evaluate AI coding tools for computing education. For instance, what mental models do novices form both about the code that AI generates and about how the AI works to produce that code? And how do those novice mental models compare to experts’ mental models of AI code generation? (Section 7 of our paper has more examples.)
We hope that these findings, along with our open research questions, can spur conversations about how to work with these tools in effective, equitable, and ethical ways.
Check out our paper here and email us if you’d like to discuss anything related to it!
From “Ban It Till We Understand It” to “Resistance is Futile”: How University Programming Instructors Plan to Adapt as More Students Use AI Code Generation and Explanation Tools such as ChatGPT and GitHub Copilot. Sam Lau and Philip J. Guo. ACM Conference on International Computing Education Research (ICER), August 2023.
There’s a lot of angst about software developers “losing their jobs” to AI, being replaced by a more intelligent version of ChatGPT, GitHub’s Copilot, Google’s Codey, or something similar. Matt Welsh has been talking and writing about the end of programming as such. He’s asking whether large language models eliminate programming as we know it, and he’s excited that the answer is “yes”: eventually, if not in the immediate future. But what does this mean in practice? What does this mean for people who earn their living from writing software?
Some companies will certainly value AI as a tool for replacing human effort, rather than for augmenting human capabilities. Programmers who work for those companies risk losing their jobs to AI. If you work for one of those organizations, I’m sorry for you, but it’s really an opportunity. Despite the well-publicized layoffs, the job market for programmers is great, it’s likely to remain great, and you’re probably better off finding an employer who doesn’t see you as an expense to be minimized. It’s time to learn some new skills and find an employer who really values you.
But the number of programmers who are “replaced by AI” will be small. Here’s why and how the use of AI will change the discipline as a whole. I did a very non-scientific study of the amount of time programmers actually spend writing code. OK, I just typed “How much of a software developer’s time is spent coding” into the search bar and looked at the top few articles, which gave percentages ranging from 10% to 40%. My own sense, from talking to and observing many people over the years, falls into the lower end of that range: 15% to 20%.
ChatGPT won’t make the 20% of their time that programmers spend writing code disappear completely. You still have to write prompts, and we’re all in the process of learning that if you want ChatGPT to do a good job, the prompts have to be very detailed. How much time and effort does that save? I’ve seen estimates as high as 80%, but I don’t believe them; I think 25% to 50% is more reasonable. If 20% of your time is spent coding, and AI-based code generation makes you 50% more efficient, then you’re really only getting about 10% of your time back. You can use it to produce more code—I’ve yet to see a programmer who was underworked, or who wasn’t up against an impossible delivery date. Or you can spend more time on the “rest of the job,” the 80% of your time that wasn’t spent writing code. Some of that time is spent in pointless meetings, but much of “the rest of the job” is understanding the user’s needs, designing, testing, debugging, reviewing code, finding out what the user really needs (that they didn’t tell you the first time), refining the design, building an effective user interface, auditing for security, and so on. It’s a lengthy list.
That “rest of the job” (particularly the “user’s needs” part) is something our industry has never been particularly good at. Design—of the software itself, the user interfaces, and the data representation—is certainly not going away, and isn’t something the current generation of AI is very good at. We’ve come a long way, but I don’t know anyone who hasn’t had to rescue code that was best described as a “seething mass of bits.” Testing and debugging—well, if you’ve played with ChatGPT much, you know that testing and debugging won’t disappear. AIs generate incorrect code, and that’s not going to end soon. Security auditing will only become more important, not less; it’s very hard for a programmer to understand the security implications of code they didn’t write. Spending more time on these things—and leaving the details of pushing out lines of code to an AI—will surely improve the quality of the products we deliver.
Now, let’s take a really long term view. Let’s assume that Matt Welsh is right, and that programming as we know it will disappear—not tomorrow, but sometime in the next 20 years. Does it really disappear? A couple of weeks ago, I showed Tim O’Reilly some of my experiments with Ethan and Lilach Mollick’s prompts for using AI in the classroom. His reaction was “This prompt is really programming.” He’s right. Writing a detailed prompt really is just a different form of programming. You’re still telling a computer what you want it to do, step by step. And I realized that, after spending 20 years complaining that programming hasn’t changed significantly since the 1970s, ChatGPT has suddenly taken that next step. It isn’t a step towards some new paradigm, whether functional, object oriented, or hyperdimensional. I expected the next step in programming languages to be visual, but it isn’t that either. It’s a step towards a new kind of programming that doesn’t require a formally defined syntax or semantics. Programming without virtual punch cards. Programming that doesn’t require you to spend half your time looking up the names and parameters of library functions that you’ve forgotten about.
In the best of all possible worlds, that might bring the time spent actually writing code down to zero, or close to it. But that best case only saves 20% of a programmer’s time. Furthermore, it doesn’t really eliminate programming. It changes it—possibly making programmers more efficient, and definitely giving programmers more time to talk to users, understand the problems they face, and design good, secure systems for solving those problems. Counting lines of code is less important than understanding problems in depth and figuring out how to solve them—but that’s nothing new. Twenty years ago, the Agile Manifesto pointed in this direction, valuing:
Individuals and interactions over processes and tools
Working software over comprehensive documentation
Customer collaboration over contract negotiation
Responding to change over following a plan
Despite 23 years of “agile practices,” customer collaboration has always been shortchanged. Without engaging with customers and users, Agile quickly collapses to a set of rituals. Will freeing programmers from syntax actually yield more time to collaborate with customers and respond to change? To prepare for this future, programmers will need to learn more about working directly with customers and designing software that meets their needs. That’s an opportunity, not a disaster. Programmers have labored too long under the stigma of being neckbeards who can’t and shouldn’t be allowed to talk to humans. It’s time to reject that stereotype, and to build software as if people mattered.
AI isn’t something to be feared. Writing about OpenAI’s new Code Interpreter plug-in (gradually rolling out now), Ethan Mollick says “My time becomes more valuable, not less, as I can concentrate on what is important, rather than the rote.” AI is something to be learned, tested, and incorporated into programming practices so that programmers can spend more time on what’s really important: understanding and solving problems. The endpoint of this revolution won’t be an unemployment line; it will be better software. The only thing to be feared is failing to make that transition.
Programming isn’t going to go away. It’s going to change, and those changes will be for the better.
Come to FOSSY 2023!
Show Notes: FOSSY 2023 will happen next week in Portland, OR, USA.Send feedback and comments on the cast to <oggcast@faif.us>. You can keep in touch with Free as in Freedom by following Conservancy on on Twitter and and FaiF on Twitter. We are working on setting up a group chat again, too!
Free as in Freedom is produced by Dan Lynch of danlynch.org. Theme music written and performed by Mike Tarantino with Charlie Paxson on drums.
The content of this audcast, and the accompanying show notes and music are licensed under the Creative Commons Attribution-Share-Alike 4.0 license (CC BY-SA 4.0).
A surprising number of the entries for AI are about generative models that don’t generate text or artwork—specifically, they generate human voices or music. Is voice the next frontier for AI? Google’s AudioPaLM, which unites speech recognition, speech synthesis, and language modeling, may show the direction in which AI is heading. There’s also increasing concern about the consequences of training AI on data that was generated by AI. With less input from real humans, does “model collapse” lead to output that is mediocre at best?
AIDoes your company plan to release an AI chatbot, similar to OpenAI’s ChatGPT or Google’s Bard? Doing so means giving the general public a freeform text box for interacting with your AI model.
That doesn’t sound so bad, right? Here’s the catch: for every one of your users who has read a “Here’s how ChatGPT and Midjourney can do half of my job” article, there may be at least one who has read one offering “Here’s how to get AI chatbots to do something nefarious.” They’re posting screencaps as trophies on social media; you’re left scrambling to close the loophole they exploited.
Welcome to your company’s new AI risk management nightmare.
So, what do you do? I’ll share some ideas for mitigation. But first, let’s dig deeper into the problem.
Old Problems Are New AgainThe text-box-and-submit-button combo exists on pretty much every website. It’s been that way since the web form was created roughly thirty years ago. So what’s so scary about putting up a text box so people can engage with your chatbot?
Those 1990s web forms demonstrate the problem all too well. When a person clicked “submit,” the website would pass that form data through some backend code to process it—thereby sending an e-mail, creating an order, or storing a record in a database. That code was too trusting, though. Malicious actors determined that they could craft clever inputs to trick it into doing something unintended, like exposing sensitive database records or deleting information. (The most popular attacks were cross-site scripting and SQL injection, the latter of which is best explained in the story of “Little Bobby Tables.”)
With a chatbot, the web form passes an end-user’s freeform text input—a “prompt,” or a request to act—to a generative AI model. That model creates the response images or text by interpreting the prompt and then replaying (a probabilistic variation of) the patterns it uncovered in its training data.
That leads to three problems:
Why haven’t we seen these problems with other kinds of AI models, then? Because most of those have been deployed in such a way that they are only communicating with trusted internal systems. Or their inputs pass through layers of indirection that structure and limit their shape. Models that accept numeric inputs, for example, might sit behind a filter that only permits the range of values observed in the training data.
What Can You Do?Before you give up on your dreams of releasing an AI chatbot, remember: no risk, no reward.
The core idea of risk management is that you don’t win by saying “no” to everything. You win by understanding the potential problems ahead, then figure out how to steer clear of them. This approach reduces your chances of downside loss while leaving you open to the potential upside gain.
I’ve already described the risks of your company deploying an AI chatbot. The rewards include improvements to your products and services, or streamlined customer service, or the like. You may even get a publicity boost, because just about every other article these days is about how companies are using chatbots.
So let’s talk about some ways to manage that risk and position you for a reward. (Or, at least, position you to limit your losses.)
Spread the word: The first thing you’ll want to do is let people in the company know what you’re doing. It’s tempting to keep your plans under wraps—nobody likes being told to slow down or change course on their special project—but there are several people in your company who can help you steer clear of trouble. And they can do so much more for you if they know about the chatbot long before it is released.
Your company’s Chief Information Security Officer (CISO) and Chief Risk Officer will certainly have ideas. As will your legal team. And maybe even your Chief Financial Officer, PR team, and head of HR, if they have sailed rough seas in the past.
Define a clear terms of service (TOS) and acceptable use policy (AUP): What do you do with the prompts that people type into that text box? Do you ever provide them to law enforcement or other parties for analysis, or feed it back into your model for updates? What guarantees do you make or not make about the quality of the outputs and how people use them? Putting your chatbot’s TOS front-and-center will let people know what to expect before they enter sensitive personal details or even confidential company information. Similarly, an AUP will explain what kinds of prompts are permitted.
(Mind you, these documents will spare you in a court of law in the event something goes wrong. They may not hold up as well in the court of public opinion, as people will accuse you of having buried the important details in the fine print. You’ll want to include plain-language warnings in your sign-up and around the prompt’s entry box so that people can know what to expect.)
Prepare to invest in defense: You’ve allocated a budget to train and deploy the chatbot, sure. How much have you set aside to keep attackers at bay? If the answer is anywhere close to “zero”—that is, if you assume that no one will try to do you harm—you’re setting yourself up for a nasty surprise. At a bare minimum, you will need additional team members to establish defenses between the text box where people enter prompts and the chatbot’s generative AI model. That leads us to the next step.
Keep an eye on the model: Longtime readers will be familiar with my catchphrase, “Never let the machines run unattended.” An AI model is not self-aware, so it doesn’t know when it’s operating out of its depth. It’s up to you to filter out bad inputs before they induce the model to misbehave.
You’ll also need to review samples of the prompts supplied by end-users (there’s your TOS calling) and the results returned by the backing AI model. This is one way to catch the small cracks before the dam bursts. A spike in a certain prompt, for example, could imply that someone has found a weakness and they’ve shared it with others.
Be your own adversary: Since outside actors will try to break the chatbot, why not give some insiders a try? Red-team exercises can uncover weaknesses in the system while it’s still under development.
This may seem like an invitation for your teammates to attack your work. That’s because it is. Better to have a “friendly” attacker uncover problems before an outsider does, no?
Narrow the scope of audience: A chatbot that’s open to a very specific set of users—say, “licensed medical practitioners who must prove their identity to sign up and who use 2FA to login to the service”—will be tougher for random attackers to access. (Not impossible, but definitely tougher.) It should also see fewer hack attempts by the registered users because they’re not looking for a joyride; they’re using the tool to complete a specific job.
Build the model from scratch (to narrow the scope of training data): You may be able to extend an existing, general-purpose AI model with your own data (through an ML technique called transfer learning). This approach will shorten your time-to-market, but also leave you to question what went into the original training data. Building your own model from scratch gives you complete control over the training data, and therefore, additional influence (though, not “control”) over the chatbot’s outputs.
This highlights an added value in training on a domain-specific dataset: it’s unlikely that anyone would, say, trick the finance-themed chatbot BloombergGPT into revealing the secret recipe for Coca-Cola or instructions for acquiring illicit substances. The model can’t reveal what it doesn’t know.
Training your own model from scratch is, admittedly, an extreme option. Right now this approach requires a combination of technical expertise and compute resources that are out of most companies’ reach. But if you want to deploy a custom chatbot and are highly sensitive to reputation risk, this option is worth a look.
Slow down: Companies are caving to pressure from boards, shareholders, and sometimes internal stakeholders to release an AI chatbot. This is the time to remind them that a broken chatbot released this morning can be a PR nightmare before lunchtime. Why not take the extra time to test for problems?
OnwardThanks to its freeform input and output, an AI-based chatbot exposes you to additional risks above and beyond using other kinds of AI models. People who are bored, mischievous, or looking for fame will try to break your chatbot just to see whether they can. (Chatbots are extra tempting right now because they are novel, and “corporate chatbot says weird things” makes for a particularly humorous trophy to share on social media.)
By assessing the risks and proactively developing mitigation strategies, you can reduce the chances that attackers will convince your chatbot to give them bragging rights.
I emphasize the term “reduce” here. As your CISO will tell you, there’s no such thing as a “100% secure” system. What you want to do is close off the easy access for the amateurs, and at least give the hardened professionals a challenge.
Many thanks to Chris Butler and Michael S. Manley for reviewing (and dramatically improving) early drafts of this article. Any rough edges that remain are mine.
Understanding AI’s mysterious “opaque box” is paramount to creating explainable AI. This can be simplified by considering that AI, like all other technology, has a supply chain. Knowing what makes up the supply chain is critical to enforcing the security of the AI system, establishing trust with the consumer of the AI’s output, and protecting your organization from undue risk.
When pondering your approach to dissecting AI’s supply chain, consider how production, shipping, delivery, and invoicing are steps in just about any supply chain, for everything that you use, from toothpaste to technology. AI models are also created and delivered via supply chains.
Some of the steps in AI’s type of supply chain can be tricky to follow, with special gotchas like technology company trade secrets, closed code, and program synthesis—which is the process of AI writing its own code to improve itself. Combined with continuous machine learning cycles and deployments, reviews, and recalls, there are a lot of opportunities to bring transparency to the opaque box.
Blockchain technology is chosen by companies like Walmart to bring transparency to supply chains like food production and delivery, because it is tamper evident and distributed. Blockchain technology is used in an enterprise stack alongside other systems, to make integrations more secure and to establish a single audit trail. Verification, including that of the identity of all participants in a blockchain network, and compliance are woven throughout the workflow and processes.
Typically, an enterprise blockchain audit trail will consist of linked blocks containing transactions that reference hyperlinks to data that is stored off chain in traditional databases. Meanwhile, the system creates a cryptographic verification of that data and stores the verification on blockchain, which is comparable to the traditional process of providing a checksum to ensure integrity of a file download. If data on the blockchain network ever undergoes tampering, the cryptographic hash used as verification will no longer compute to the same value.
When you dissect AI’s supply chain, at the root, you will find algorithms. These are the mathematical formulas written to simulate functions of the brain, which underlie the AI programming. The algorithms are compiled into code libraries, and then distributed to AI developers who use them to write custom AI models. Meanwhile, a data scientist acquires and prepares training data, which is then used to bring the AI model to life.
University of Baltimore Law Professor Michelle Gillman, who fights to help people who were automatically denied benefits, recently spoke with NBC about the importance of understanding the origin of algorithms when managing AI risk. According to Gillman, whose clients often face life and death situations that are being decided by AI, “I’ve been in hearings where no one in the room can describe for me, how does the algorithm work? What factors does it weigh? How does it weigh those factors? And so you are really left unable to make a case for your client in those circumstances.”
Next, a workflow begins that implements an AI engineering and machine learning operations (MLOps) process, in which cycles of experiments and deployments are conducted, and the AI model, its data, and the variables, or hyperparameters of the experiment are tested, tweaked, and improved. This part of the supply chain keeps going in a cycle even after delivery to the consumer, since the training and improvement process is generally continuous. The consumer’s input in the form of reviews and ratings becomes part of the process to improve the model. The stakeholders of the project, such as the management of the organization that built the AI model, may also add their input and follow through to make sure it is considered.
If an organization is large, the AI model’s supply chain can involve extended teams and even multiple organizations. Or, it is entirely possible that by using cloud services and AI marketplaces, a single developer can perform all of these functions alone. In any case, you can add an enterprise blockchain technology, like Hyperledger Fabric, to the stack so you can track, trace, audit, and even recall your model.
An enterprise blockchain network is sometimes used to bring transparency to the supply chain. This helps network participants trust one another because they are members of the same blockchain network. The blockchain network is also really helpful when something goes wrong and a product needs to be quickly traced to its origin.
In the case of Walmart, they pioneered the use of enterprise blockchain to track and trace food that potentially carried a foodborne illness. For example, if a customer became sick from a package of sliced mangoes in any Walmart store, the mangoes had to be discarded at all of the stores because it took more than 6 days to trace the affected shipment. The new blockchain network cut this time to 2.2 seconds, saving Walmart the expense of discarding good mangoes. Walmart continues with their supply chain blockchain strategy today, which has become the foundation of automated payment systems for their many suppliers.
When this strategy is applied to AI’s opaque box, the convenience of a supply chain blockchain network will help you to track and trace important factors like the reason why an AI model’s intent or domain has drifted, or to learn what type of treatment was given to data that was used to produce a certain outcome. As explained in the O’Reilly book I co-authored, Blockchain Tethered AI, there are four blockchain controls for AI, which are:
This ability to track and trace can also be extended to the consumer, through the display of trust logos. The concept of trust logos, which are the long-time hallmark of e-commerce security, can be applied to AI by connecting the logos to the underlying blockchain network, and programming them to alert consumers should the AI model become compromised. A similar method could be used to show whether a customer service representative is an AI or a human.
Keep in mind that people in different roles may need different types of information in order to trust AI models. Depending on the perspective of the entity requesting the information, different levels of traceability could be desired. For example, a person answering their phone should be able to see an indicator as to whether a caller is AI, and whether or not the AI is from a trustworthy source. An engineer deciding whether or not to integrate AI components with their models would want a much deeper understanding of the supply chain, and a stakeholder might want to see if the reviews and comments are authentic and find out what is being done to address any recalls. This also brings up the question of a special handshake to enable AI models to trust one another and establish boundaries.
Even though you might not know everything about your AI model, you can commit the facts you do know to blockchain. Develop an AI factsheet as described in Chapter 1 of Blockchain Tethered AI. If you have used models that you have downloaded from marketplaces, you can typically find an AI model card and data cards that provide basic facts about the materials you are using. Also, you can always document that a part of the model is indeed “opaque,” and complete that part later once the details are known.
You can implement your blockchain network for your AI model’s supply chain in the same way that enterprise blockchain networks are used by developers for other purposes. You only need to record cryptographic verifications on your blockchain network, while storing the actual components of the AI off-chain. The code that comes with Blockchain Tethered AI can help you to visualize and implement this architecture.
This blockchain verification, which works similar to a checksum you might see when downloading a file, can be checked against the model and its components at any point to see if they have undergone any tampering. This type of use of a blockchain network doesn’t involve cryptocurrency or miners or use any unusually high amounts of energy to run, and should be thought of instead as a distributed text-based super-log that is automated by smart contracts.
Being able to track and trace goods in this way helps prevent sales of counterfeit goods, helps food companies to recall items quickly without having to throw everything away, and helps artists, musicians, and content creators be paid for their work. When applying these techniques and controls to make AI’s opaque box explainable, your AI models will enjoy the competitive advantage of being trackable, traceable, controllable, and even stoppable.
“Mitigating the risk of extinction from A.I. should be a global priority alongside other societal-scale risks, such as pandemics and nuclear war,” according to a statement signed by more than 350 business and technical leaders, including the developers of today’s most important AI platforms.
Among the possible risks leading to that outcome is what is known as “the alignment problem.” Will a future super-intelligent AI share human values, or might it consider us an obstacle to fulfilling its own goals? And even if AI is still subject to our wishes, might its creators—or its users—make an ill-considered wish whose consequences turn out to be catastrophic, like the wish of fabled King Midas that everything he touches turn to gold? Oxford philosopher Nick Bostrom, author of the book Superintelligence, once posited as a thought experiment an AI-managed factory given the command to optimize the production of paperclips. The “paperclip maximizer” comes to monopolize the world’s resources and eventually decides that humans are in the way of its master objective.
Far-fetched as that sounds, the alignment problem is not just a far future consideration. We have already created a race of paperclip maximizers. Science fiction writer Charlie Stross has noted that today’s corporations can be thought of as “slow AIs.” And much as Bostrom feared, we have given them an overriding command: to increase corporate profits and shareholder value. The consequences, like those of Midas’s touch, aren’t pretty. Humans are seen as a cost to be eliminated. Efficiency, not human flourishing, is maximized.
In pursuit of this overriding goal, our fossil fuel companies continue to deny climate change and hinder attempts to switch to alternative energy sources, drug companies peddle opioids, and food companies encourage obesity. Even once-idealistic internet companies have been unable to resist the master objective, and in pursuing it have created addictive products of their own, sown disinformation and division, and resisted attempts to restrain their behavior.
Even if this analogy seems far fetched to you, it should give you pause when you think about the problems of AI governance.
Corporations are nominally under human control, with human executives and governing boards responsible for strategic direction and decision-making. Humans are “in the loop,” and generally speaking, they make efforts to restrain the machine, but as the examples above show, they often fail, with disastrous results. The efforts at human control are hobbled because we have given the humans the same reward function as the machine they are asked to govern: we compensate executives, board members, and other key employees with options to profit richly from the stock whose value the corporation is tasked with maximizing. Attempts to add environmental, social, and governance (ESG) constraints have had only limited impact. As long as the master objective remains in place, ESG too often remains something of an afterthought.
Much as we fear a superintelligent AI might do, our corporations resist oversight and regulation. Purdue Pharma successfully lobbied regulators to limit the risk warnings planned for doctors prescribing Oxycontin and marketed this dangerous drug as non-addictive. While Purdue eventually paid a price for its misdeeds, the damage had largely been done and the opioid epidemic rages unabated.
What might we learn about AI regulation from failures of corporate governance?
OpenAI CEO Sam Altman has begged for government regulation, but tellingly, has suggested that such regulation apply only to future, more powerful versions of AI. This is a mistake. There is much that can be done right now.
We should require registration of all AI models above a certain level of power, much as we require corporate registration. And we should define current best practices in the management of AI systems and make them mandatory, subject to regular, consistent disclosures and auditing, much as we require public companies to regularly disclose their financials.
The work that Timnit Gebru, Margaret Mitchell, and their coauthors have done on the disclosure of training data (“Datasheets for Datasets”) and the performance characteristics and risks of trained AI models (“Model Cards for Model Reporting”) are a good first draft of something much like the Generally Accepted Accounting Principles (and their equivalent in other countries) that guide US financial reporting. Might we call them “Generally Accepted AI Management Principles”?
It’s essential that these principles be created in close cooperation with the creators of AI systems, so that they reflect actual best practice rather than a set of rules imposed from without by regulators and advocates. But they can’t be developed solely by the tech companies themselves. In his book Voices in the Code, James G. Robinson (now Director of Policy for OpenAI) points out that every algorithm makes moral choices, and explains why those choices must be hammered out in a participatory and accountable process. There is no perfectly efficient algorithm that gets everything right. Listening to the voices of those affected can radically change our understanding of the outcomes we are seeking.
But there’s another factor too. OpenAI has said that “Our alignment research aims to make artificial general intelligence (AGI) aligned with human values and follow human intent.” Yet many of the world’s ills are the result of the difference between stated human values and the intent expressed by actual human choices and actions. Justice, fairness, equity, respect for truth, and long-term thinking are all in short supply. An AI model such as GPT4 has been trained on a vast corpus of human speech, a record of humanity’s thoughts and feelings. It is a mirror. The biases that we see there are our own. We need to look deeply into that mirror, and if we don’t like what we see, we need to change ourselves, not just adjust the mirror so it shows us a more pleasing picture!
To be sure, we don’t want AI models to be spouting hatred and misinformation, but simply fixing the output is insufficient. We have to reconsider the input—both in the training data and in the prompting. The quest for effective AI governance is an opportunity to interrogate our values and to remake our society in line with the values we choose. The design of an AI that will not destroy us may be the very thing that saves us in the end.
The world changed on November 30, 2022 as surely as it did on August 12, 1908 when the first Model T left the Ford assembly line. That was the date when OpenAI released ChatGPT, the day that AI emerged from research labs into an unsuspecting world. Within two months, ChatGPT had over a hundred million users—faster adoption than any technology in history.
The hand wringing soon began. Most notably, The Future of Life Institute published an open letter calling for an immediate pause in advanced AI research, asking: “Should we let machines flood our information channels with propaganda and untruth? Should we automate away all the jobs, including the fulfilling ones? Should we develop nonhuman minds that might eventually outnumber, outsmart, obsolete and replace us? Should we risk loss of control of our civilization?”
In response, the Association for the Advancement of Artificial Intelligence published its own letter citing the many positive differences that AI is already making in our lives and noting existing efforts to improve AI safety and to understand its impacts. Indeed, there are important ongoing gatherings about AI regulation like the Partnership on AI’s recent convening on Responsible Generative AI, which happened just this past week. The UK has already announced its intention to regulate AI, albeit with a light, “pro-innovation” touch. In the US, Senate Minority Leader Charles Schumer has announced plans to introduce “a framework that outlines a new regulatory regime” for AI. The EU is sure to follow, in the worst case leading to a patchwork of conflicting regulations.
All of these efforts reflect the general consensus that regulations should address issues like data privacy and ownership, bias and fairness, transparency, accountability, and standards. OpenAI’s own AI safety and responsibility guidelines cite those same goals, but in addition call out what many people consider the central, most general question: how do we align AI-based decisions with human values? They write:
“AI systems are becoming a part of everyday life. The key is to ensure that these machines are aligned with human intentions and values.”
But whose human values? Those of the benevolent idealists that most AI critics aspire to be? Those of a public company bound to put shareholder value ahead of customers, suppliers, and society as a whole? Those of criminals or rogue states bent on causing harm to others? Those of someone well meaning who, like Aladdin, expresses an ill-considered wish to an all-powerful AI genie?
There is no simple way to solve the alignment problem. But alignment will be impossible without robust institutions for disclosure and auditing. If we want prosocial outcomes, we need to design and report on the metrics that explicitly aim for those outcomes and measure the extent to which they have been achieved. That is a crucial first step, and we should take it immediately. These systems are still very much under human control. For now, at least, they do what they are told, and when the results don’t match expectations, their training is quickly improved. What we need to know is what they are being told.
What should be disclosed? There is an important lesson for both companies and regulators in the rules by which corporations—which science-fiction writer Charlie Stross has memorably called “slow AIs”—are regulated. One way we hold companies accountable is by requiring them to share their financial results compliant with Generally Accepted Accounting Principles or the International Financial Reporting Standards. If every company had a different way of reporting its finances, it would be impossible to regulate them.
Today, we have dozens of organizations that publish AI principles, but they provide little detailed guidance. They all say things like “Maintain user privacy” and “Avoid unfair bias” but they don’t say exactly under what circumstances companies gather facial images from surveillance cameras, and what they do if there is a disparity in accuracy by skin color. Today, when disclosures happen, they are haphazard and inconsistent, sometimes appearing in research papers, sometimes in earnings calls, and sometimes from whistleblowers. It is almost impossible to compare what is being done now with what was done in the past or what might be done in the future. Companies cite user privacy concerns, trade secrets, the complexity of the system, and various other reasons for limiting disclosures. Instead, they provide only general assurances about their commitment to safe and responsible AI. This is unacceptable.
Imagine, for a moment, if the standards that guide financial reporting simply said that companies must accurately reflect their true financial condition without specifying in detail what that reporting must cover and what “true financial condition” means. Instead, independent standards bodies such as the Financial Accounting Standards Board, which created and oversees GAAP, specify those things in excruciating detail. Regulatory agencies such as the Securities and Exchange Commission then require public companies to file reports according to GAAP, and auditing firms are hired to review and attest to the accuracy of those reports.
So too with AI safety. What we need is something equivalent to GAAP for AI and algorithmic systems more generally. Might we call it the Generally Accepted AI Principles? We need an independent standards body to oversee the standards, regulatory agencies equivalent to the SEC and ESMA to enforce them, and an ecosystem of auditors that is empowered to dig in and make sure that companies and their products are making accurate disclosures.
But if we are to create GAAP for AI, there is a lesson to be learned from the evolution of GAAP itself. The systems of accounting that we take for granted today and use to hold companies accountable were originally developed by medieval merchants for their own use. They were not imposed from without, but were adopted because they allowed merchants to track and manage their own trading ventures. They are universally used by businesses today for the same reason.
So, what better place to start with developing regulations for AI than with the management and control frameworks used by the companies that are developing and deploying advanced AI systems?
The creators of generative AI systems and Large Language Models already have tools for monitoring, modifying, and optimizing them. Techniques such as RLHF (“Reinforcement Learning from Human Feedback”) are used to train models to avoid bias, hate speech, and other forms of bad behavior. The companies are collecting massive amounts of data on how people use these systems. And they are stress testing and “red teaming” them to uncover vulnerabilities. They are post-processing the output, building safety layers, and have begun to harden their systems against “adversarial prompting” and other attempts to subvert the controls they have put in place. But exactly how this stress testing, post processing, and hardening works—or doesn’t—is mostly invisible to regulators.
Regulators should start by formalizing and requiring detailed disclosure about the measurement and control methods already used by those developing and operating advanced AI systems.
In the absence of operational detail from those who actually create and manage advanced AI systems, we run the risk that regulators and advocacy groups “hallucinate” much like Large Language Models do, and fill the gaps in their knowledge with seemingly plausible but impractical ideas.
Companies creating advanced AI should work together to formulate a comprehensive set of operating metrics that can be reported regularly and consistently to regulators and the public, as well as a process for updating those metrics as new best practices emerge.
What we need is an ongoing process by which the creators of AI models fully, regularly, and consistently disclose the metrics that they themselves use to manage and improve their services and to prohibit misuse. Then, as best practices are developed, we need regulators to formalize and require them, much as accounting regulations have formalized the tools that companies already used to manage, control, and improve their finances. It’s not always comfortable to disclose your numbers, but mandated disclosures have proven to be a powerful tool for making sure that companies are actually following best practices.
It is in the interests of the companies developing advanced AI to disclose the methods by which they control AI and the metrics they use to measure success, and to work with their peers on standards for this disclosure. Like the regular financial reporting required of corporations, this reporting must be regular and consistent. But unlike financial disclosures, which are generally mandated only for publicly traded companies, we likely need AI disclosure requirements to apply to much smaller companies as well.
Disclosures should not be limited to the quarterly and annual reports required in finance. For example, AI safety researcher Heather Frase has argued that “a public ledger should be created to report incidents arising from large language models, similar to cyber security or consumer fraud reporting systems.” There should also be dynamic information sharing such as is found in anti-spam systems.
It might also be worthwhile to enable testing by an outside lab to confirm that best practices are being met and what to do when they are not. One interesting historical parallel for product testing may be found in the certification of fire safety and electrical devices by an outside non-profit auditor, Underwriter’s Laboratory. UL certification is not required, but it is widely adopted because it increases consumer trust.
This is not to say that there may not be regulatory imperatives for cutting-edge AI technologies that are outside the existing management frameworks for these systems. Some systems and use cases are riskier than others. National security considerations are a good example. Especially with small LLMs that can be run on a laptop, there is a risk of an irreversible and uncontrollable proliferation of technologies that are still poorly understood. This is what Jeff Bezos has referred to as a “one way door,” a decision that, once made, is very hard to undo. One way decisions require far deeper consideration, and may require regulation from without that runs ahead of existing industry practices.
Furthermore, as Peter Norvig of the Stanford Institute for Human Centered AI noted in a review of a draft of this piece, “We think of ‘Human-Centered AI’ as having three spheres: the user (e.g., for a release-on-bail recommendation system, the user is the judge); the stakeholders (e.g., the accused and their family, plus the victim and family of past or potential future crime); the society at large (e.g. as affected by mass incarceration).”
Princeton computer science professor Arvind Narayanan has noted that these systemic harms to society that transcend the harms to individuals require a much longer term view and broader schemes of measurement than those typically carried out inside corporations. But despite the prognostications of groups such as the Future of Life Institute, which penned the AI Pause letter, it is usually difficult to anticipate these harms in advance. Would an “assembly line pause” in 1908 have led us to anticipate the massive social changes that 20th century industrial production was about to unleash on the world? Would such a pause have made us better or worse off?
Given the radical uncertainty about the progress and impact of AI, we are better served by mandating transparency and building institutions for enforcing accountability than we are in trying to head off every imagined particular harm.
We shouldn’t wait to regulate these systems until they have run amok. But nor should regulators overreact to AI alarmism in the press. Regulations should first focus on disclosure of current monitoring and best practices. In that way, companies, regulators, and guardians of the public interest can learn together how these systems work, how best they can be managed, and what the systemic risks really might be.
A few months ago, I wrote about some experiments with prime numbers. I generated a 16-digit non-prime number by multiplying two 8-digit prime numbers, and asked ChatGPT (using GPT -3.5) whether the larger number was prime. It answered correctly that the number was non-prime, but when it told me the number’s prime factors, it was clearly wrong. It also generated a short program that implemented the widely used Miller-Rabin primality test. After fixing some obvious errors, I ran the program–and while it told me (correctly) that my number was non-prime, when compared to a known good implementation of Miller-Rabin, ChatGPT’s code made many mistakes. When it became available, GPT-4 gave me similar results. And the result itself–well, that could have been a good guess. There’s a roughly a 97% chance that a randomly chosen 16-digit number will be non-prime.
OpenAI recently opened their long-awaited Plugins feature to users of ChatGPT Plus (the paid version) using the GPT-4 model. One of the first plugins was from Wolfram, the makers of Mathematica and Wolfram Alpha. I had to try this! Specifically, I was compelled to re-try my prime test. And everything worked: ChatGPT sent the problem to Wolfram, it determined that number was not prime, and gave me the correct prime factors. It didn’t generate any code, but provided a link to the Wolfram Alpha result page that described how to test for primality. The process of going through ChatGPT to Wolfram and back was also painfully slow, much slower than using Wolfram Alpha directly or writing a few lines of Python. But it worked and, for fans of prime numbers, that’s a plus.
I was still uncomfortable. How does ChatGPT decide what to offload to Wolfram Alpha, and what to handle on its own? I tried a few questions from calculus; unsurprisingly, they went to Wolfram. Then I got really simple: “How much is 3 + 5?” No Wolfram, and I wasn’t surprised when ChatGPT told me the answer was 8. But that begged the question: what about more complex arithmetic? So I asked “How much is 123456789 + 98776543321?”, a problem that could be solved by any elementary school student who has learned how to carry. Again, no Wolfram, but this time, the answer was incorrect.
We’ve long known that ChatGPT was poor at arithmetic, in addition to being poor at more advanced math. The Wolfram plugin solves the math problem with ease. However, ChatGPT is still poor at arithmetic, and still attempts to do arithmetic on its own. The important question that I can’t answer is “when does a problem become complex enough to send to the plugin?” The plugin is a big win, but not an unqualified one.
ChatGPT’s tendency to make up citations is another well-known problem. A few weeks ago, a story circulated about a lawyer who used ChatGPT to write a brief. ChatGPT cited a lot of case law, but made up all the citations. When a judge asked him to produce the actual case law, the lawyer went back to ChatGPT–which obediently made up the cases themselves. The judge was not pleased. That raises another question: ChatGPT has always been prone to making up citations–but now there’s a plugin for that! The ScholarAI plugin searches academic databases for citations, and returns links. That wouldn’t have helped this lawyer (I don’t yet see plugins from Westlaw or LexisNexis), but it’s worth asking: what about citations?
I first tried asking a medical question. I’m not a doctor, so the question was simple: what’s the latest research on antibiotic-resistant bacteria? ChatGPT sent the question to ScholarAI, and I got back a long list of relevant citations. (The plugin appeared to get into a loop, so I eventually terminated the output.) While I’m not competent to evaluate the quality or relevance of the papers, all the links were valid: the papers were real, and the author names were correct. No hallucinations here.
I followed up with some questions about English literature (I have a PhD, so I can make up real questions). I didn’t get as many citations in return, possibly because we don’t have preprint servers like ArXiv, and have done little to protest journals’ proprietary lock on scholarship. However, the citations I got were valid: real books and articles, with the authors listed correctly.
That begged another question, though. A list of articles is certainly useful, but you still have to read them all to write the paper. Could ChatGPT write an essay for me? I asked it to write about colonialism in the work of Salman Rushdie, and got a passable short essay. It is what I’d call a “lazy” prompt: what I’d expect from a student who was interested in getting out of work, rather than using the AI to learn. There were citations, and they were real; ChatGPT didn’t link to the publications cited, but Google made it easy to find them. The resulting essay didn’t demonstrate any familiarity with the articles beyond the abstract–fair enough, since for most of the sources, the abstract was all that was publicly available. More to the point, the article didn’t really make any connections to Rushdie’s fiction. There were many sentences like this: “Hamish Dalley discusses the role of the historical novel in postcolonial writing, a genre to which many of Rushdie’s works belong.” True, but that doesn’t say much about either Rushdie’s work or Dalley’s. As I said, the essay was passable, but if I had to grade it, the student who turned it in wouldn’t have been happy. Still, ChatGPT and ScholarAI get credit for doing a decent literature search that could be the basis for an excellent paper. And if a student took this initial prompt, read the academic articles along with Rushdie’s novels, and used that to write a more detailed prompt telling ChatGPT exactly what points he wanted to make, with relevant quotations, the result could have been excellent. An essay isn’t an exercise in providing N*1000 words; it’s the outcome of a thought process that involves engaging with the subject matter. If ChatGPT and ScholarAI facilitate that engagement, I wouldn’t object. But let’s be clear: regardless of who generates the words, ChatGPT’s users still have to do the reading and thinking.
As with the Wolfram plugin, it’s helpful to understand when ChatGPT is using ScholarAI, and when it isn’t. I asked ChatGPT to find articles by me; when using the plugin, it couldn’t find any, although it apologetically gave me a list of articles whose authors had the first name Michael. The sad list of Michael-authored articles notwithstanding, I’ll count that response as “correct.” I haven’t published any academic papers, though I have published a lot on O’Reilly Radar–material that any web search can find, without the need for AI or the risk of hallucination.
If you dig a bit deeper, the results are puzzling. If you use ChatGPT with plugins enabled and write a prompt that tells it not to use the plugin, it comes up empty, but suggests that you research online databases like Google Scholar. If you start a new conversation and do not enable plugins (plugins can only be enabled or disabled at the start of a conversation), you still get nothing–but ChatGPT does tell you that Michael Loukides is a well-known author who has frequently written for O’Reilly, and to check on the O’Reilly website for articles. (It isn’t clear whether these different responses have to do with the state of the plugin, or the way ChatGPT randomizes its output.) Flattery will get you somewhere, I suppose, but not very far. My publication history with O’Reilly goes back to the 1990s, and is all public; it’s not clear why ChatGPT is unaware of it. Starting a new conversation with Bing searches enabled got me a list of valid links to articles that I’ve written–but I shouldn’t have had to try three times, the process was much slower than searching with Bing (or Google) directly, and it wasn’t clear why some articles were included and some weren’t. And you really do have to try multiple times: you can’t use both Bing searches and plugins in the same conversation.
As with the Wolfram plugin, ScholarAI is a big improvement–but again, not an unqualified one. You still have to know whether the content you’re looking for is in an academic journal, on the web, or somewhere else. While ChatGPT tells you when it is using a plugin, and which plugin it is using, you can’t always predict what it will do in advance–and when it doesn’t use a plugin, ChatGPT is vulnerable to the same errors we’ve come to expect. You still have to experiment, and you still have to check the results.
As another test, I used the Kayak plugin to check out flights for some trips I might take. The plugin does a good job with major airports (including smaller ones), though it seemed to be hit-or-miss with very small airports, like New Haven (HVN). That’s a limitation of Kayak, rather than the plugin itself or ChatGPT. You currently have to enable the plugins you’re going to use at the start of each conversation, and ChatGPT doesn’t allow you to enable competing plugins. You can install both Kayak and Expedia, but you can only use one in any chat. I wouldn’t be surprised if this behavior changes as plugins mature.
Finally: all the plugins I installed were free of charge. However, I don’t think it’s called the “plugin store” for nothing. It wouldn’t surprise me to see charges for plugins, and I would be surprised if some plugins eventually require a subscription to a paid account. A number of the plugins access subscription-based services; I expect that subscriptions will be required once we are out of the Beta period.
I’m excited that plugins have finally arrived. Plugins are still in beta, so their behavior will almost certainly change; the behaviors I’ve described may have changed by the time you read this. Several changed while I was writing this article. Plugins certainly don’t eliminate the need to be careful about hallucinations and other kinds of errors, nor do they replace the need for thinking. But it’s hard to understate how important it is that ChatGPT can now reach out and access current data. When ChatGPT was limited to data before November 2021, it was an intriguing toy. It’s looking more and more like a tool.
AI, and specifically large language models, continue to dominate the news–so much so that it’s no longer a well-defined topic, with clear boundaries. AI has infiltrated programming, security, and virtually every branch of technology.
But that’s hardly news. By the time you read this, Apple may have announced their ultra-expensive, ultra-stylish AR headset; that may be enough to break the Metaverse out of its AR/VR winter. Or maybe not. If Apple can’t make technology into a fashion statement, no one can. And Rust has forked, spawning a new programming language called Crab. Will this slow Rust’s momentum? It is very hard to say. Many projects have forked and few forks thrive, but there are exceptions.
AI