You are here

Feed aggregator

The Real Problem with Software Development

O'Reilly Radar - Tue, 2023/09/12 - 04:01

A few weeks ago, I saw a tweet that said “Writing code isn’t the problem. Controlling complexity is.” I wish I could remember who said that; I will be quoting it a lot in the future. That statement nicely summarizes what makes software development difficult. It’s not just memorizing the syntactic details of some programming language, or the many functions in some API, but understanding and managing the complexity of the problem you’re trying to solve.

We’ve all seen this many times. Lots of applications and tools start simple. They do 80% of the job well, maybe 90%. But that isn’t quite enough. Version 1.1 gets a few more features, more creep into version 1.2, and by the time you get to 3.0, an elegant user interface has turned into a mess. This increase in complexity is one reason that applications tend to become less useable over time. We also see this phenomenon as one application replaces another. RCS was useful, but didn’t do everything we needed it to; SVN was better; Git does just about everything you could want, but at an enormous cost in complexity. (Could Git’s complexity be managed better? I’m not the one to say.) OS X, which used to trumpet “It just works,” has evolved to “it used to just work”; the most user-centric Unix-like system ever built now staggers under the load of new and poorly thought-out features.

The problem of complexity isn’t limited to user interfaces; that may be the least important (though most visible) aspect of the problem. Anyone who works in programming has seen the source code for some project evolve from something short, sweet, and clean to a seething mass of bits. (These days, it’s often a seething mass of distributed bits.) Some of that evolution is driven by an increasingly complex world that requires attention to secure programming, cloud deployment, and other issues that didn’t exist a few decades ago. But even here: a requirement like security tends to make code more complex—but complexity itself hides security issues. Saying “yes, adding security made the code more complex” is wrong on several fronts. Security that’s added as an afterthought almost always fails. Designing security in from the start almost always leads to a simpler result than bolting security on as an afterthought, and the complexity will stay manageable if new features and security grow together. If we’re serious about complexity, the complexity of building secure systems needs to be managed and controlled in step with the rest of the software, otherwise it’s going to add more vulnerabilities.

That brings me to my main point. We’re seeing more code that’s written (at least in first draft) by generative AI tools, such as GitHub Copilot, ChatGPT (especially with Code Interpreter), and Google Codey. One advantage of computers, of course, is that they don’t care about complexity. But that advantage is also a significant disadvantage. Until AI systems can generate code as reliably as our current generation of compilers, humans will need to understand—and debug—the code they write. Brian Kernighan wrote that “Everyone knows that debugging is twice as hard as writing a program in the first place. So if you’re as clever as you can be when you write it, how will you ever debug it?” We don’t want a future that consists of code too clever to be debugged by humans—at least not until the AIs are ready to do that debugging for us. Really brilliant programmers write code that finds a way out of the complexity: code that may be a little longer, a little clearer, a little less clever so that someone can understand it later. (Copilot running in VSCode has a button that simplifies code, but its capabilities are limited.)

Furthermore, when we’re considering complexity, we’re not just talking about individual lines of code and individual functions or methods. Most professional programmers work on large systems that can consist of thousands of functions and millions of lines of code. That code may take the form of dozens of microservices running as asynchronous processes and communicating over a network. What is the overall structure, the overall architecture, of these programs? How are they kept simple and manageable? How do you think about complexity when writing or maintaining software that may outlive its developers? Millions of lines of legacy code going back as far as the 1960s and 1970s are still in use, much of it written in languages that are no longer popular. How do we control complexity when working with these?

Humans don’t manage this kind of complexity well, but that doesn’t mean we can check out and forget about it. Over the years, we’ve gradually gotten better at managing complexity. Software architecture is a distinct specialty that has only become more important over time. It’s growing more important as systems grow larger and more complex, as we rely on them to automate more tasks, and as those systems need to scale to dimensions that were almost unimaginable a few decades ago. Reducing the complexity of modern software systems is a problem that humans can solve—and I haven’t yet seen evidence that generative AI can. Strictly speaking, that’s not a question that can even be asked yet. Claude 2 has a maximum context—the upper limit on the amount of text it can consider at one time—of 100,000 tokens1; at this time, all other large language models are significantly smaller. While 100,000 tokens is huge, it’s much smaller than the source code for even a moderately sized piece of enterprise software. And while you don’t have to understand every line of code to do a high-level design for a software system, you do have to manage a lot of information: specifications, user stories, protocols, constraints, legacies and much more. Is a language model up to that?

Could we even describe the goal of “managing complexity” in a prompt? A few years ago, many developers thought that minimizing “lines of code” was the key to simplification—and it would be easy to tell ChatGPT to solve a problem in as few lines of code as possible. But that’s not really how the world works, not now, and not back in 2007. Minimizing lines of code sometimes leads to simplicity, but just as often leads to complex incantations that pack multiple ideas onto the same line, often relying on undocumented side effects. That’s not how to manage complexity. Mantras like DRY (Don’t Repeat Yourself) are often useful (as is most of the advice in The Pragmatic Programmer), but I’ve made the mistake of writing code that was overly complex to eliminate one of two very similar functions. Less repetition, but the result was more complex and harder to understand. Lines of code are easy to count, but if that’s your only metric, you will lose track of qualities like readability that may be more important. Any engineer knows that design is all about tradeoffs—in this case, trading off repetition against complexity—but difficult as these tradeoffs may be for humans, it isn’t clear to me that generative AI can make them any better, if at all.

I’m not arguing that generative AI doesn’t have a role in software development. It certainly does. Tools that can write code are certainly useful: they save us looking up the details of library functions in reference manuals, they save us from remembering the syntactic details of the less commonly used abstractions in our favorite programming languages. As long as we don’t let our own mental muscles decay, we’ll be ahead. I am arguing that we can’t get so tied up in automatic code generation that we forget about controlling complexity. Large language models don’t help with that now, though they might in the future. If they free us to spend more time understanding and solving the higher-level problems of complexity, though, that will be a significant gain.

Will the day come when a large language model will be able to write a million line enterprise program? Probably. But someone will have to write the prompt telling it what to do. And that person will be faced with the problem that has characterized programming from the start: understanding complexity, knowing where it’s unavoidable, and controlling it.

Footnotes
  1. It’s common to say that a token is approximately ⅘ of a word. It’s not clear how that applies to source code, though. It’s also common to say that 100,000 words is the size of a novel, but that’s only true for rather short novels.
Categories: Technology

Radar Trends to Watch: September 2023

O'Reilly Radar - Tue, 2023/09/05 - 03:14

While the AI group is still the largest, it’s notable that Programming, Web, and Security are all larger than they’ve been in recent months. One reason is certainly that we’re pushing AI news into other categories as appropriate. But I also think that it’s harder to impress with AI than it used to be. AI discussions have been much more about regulation and intellectual property—which makes me wonder whether legislation should be a separate category.

That notwithstanding, it’s important that OpenAI is now allowing API users to fine-tune their GPT-4 apps. It’s as-a-service, of course. And RISC-V finally appears to be getting some serious adoption. Could it compete with Atom and Intel? We shall see.

AI
  • OpenAI has announced ChatGPT Enterprise, a version of ChatGPT that targets enterprise customers. ChatGPT Enterprise offers improved security, a promise that they won’t train on your conversations, single sign on, an admin console, a larger 32K context, higher performance, and the elimination of usage caps.
  • Facebook/Meta has released Code LLaMA, a version of their LLaMA 2 model that has been specialized for writing code. It can be used for code generation or completion. Its context window is 100,000 tokens, allowing Code LLaMA to be more accurate on larger programs.
  • OpenAI has announced that API users can now fine-tune GPT-3.5 for their own applications. Fine-tuning for GPT-4 will come later. To preserve safety, tuning data is passed through OpenAI’s moderation filter.
  • txtai is an open source embeddings database. It is a vector database that has been designed specifically to work with natural language problems.
  • TextFX is a set of tools that use Google’s PaLM 2 model to play with language. It doesn’t answer questions or write poems; it allows users to see the possibilities in words as an aid to their own creativity.
  • A US judge has ruled that an AI system cannot copyright a work. In this case, the AI itself—not the human user—was to hold the copyright. This ruling is in line with the Copyright Office’s guidance: giving prompts to a generative algorithm isn’t sufficient to create a copyrightable work.
  • Despite an error rate of roughly 50% for ChatGPT, a study shows that users prefer ChatGPT’s answers to programming questions over answers from StackOverflow. ChatGPT’s complete, articulate, and polite answers appear to be the cause of this preference.
  • AI was on the agenda at DefCon and, while results of a red teaming competition won’t be released for some months, it’s clear that security remains an afterthought, and that attacking the current AI models is extremely easy.
  • Emotion recognition is difficult, if not impossible. It is not clear that there are any credible use cases for it. AI systems are particularly bad at it. But companies are building products.
  • Watermarking has been proposed as a technique for identifying whether content was generated by AI, but it’s not a panacea. Here are some questions to help evaluate whether watermarks are useful in any given situation.
  • Zoom and Grammarly have both issued new license agreements that allow them to use data collected from users to train AI. Zoom has backed down after customer backlash, but that begs the question: Will other applications follow?
  • Using large language models for work or play is one thing, but how do you put one into production? 7 Frameworks for Serving LLMs surveys some tools for deploying language models.
  • Simon Willison provides instructions for running LLaMA 2 on a Mac. He also provides slides and a well-edited transcript of his talk about LLMs at North Bay Python.
  • PhotoGuard is a tool for protecting photos and other images from manipulation by AI systems. It adds data to the image in ways that aren’t detectable by humans, but that introduce noticeable distortions when the image is modified.
  • C2PA is a cryptographic protocol for attesting to the provenance of electronic documents. It could be used for specifying whether documents are generated by AI.
  • Google’s DeepMind has built a vision-language-action model called RT-2 (Robotic Transformer 2) that combines vision and language with the ability to control a robot. It learns both from web data (images and text) and robotic data (interactions with physical objects).
Programming
  • Maccarone is an extension to VSCode that allows you to “delegate” blocks of Python code to AI (GPT-4). The portions of the code that are under AI control are automatically updated as needed when the surrounding code is changed.
  • Microsoft is adding Python as a scripting language for Excel formulas. Python code executes in an Azure container that includes some commonly used libraries, including Matplotlib and Pandas.
  • Many companies are building platform engineering teams as a means of making software developers more effective. Here are some ideas about getting started with platform engineering.
  • A Google study of its in-house Rust use supports the claim that Rust makes it easier to produce high-quality code. The study also busts a number of myths about the language. It isn’t as hard to learn as most people think (then again, this is a Google study).
  • deno_python is a Javascript module that allows integration between Javascript (running on Deno) and Python, allowing Javascript programmers to call important Python libraries and call Python functions.
  • The Python Steering Council has announced that it will make the Global Interpreter Lock (GIL) optional in a future version of Python. Python’s GIL has long been a barrier to effective multi-threaded computing. The change will be backwards-compatible.
Web
  • Google’s controversial Web Environment Integrity proposal provides a way for web servers to cryptographically authenticate the browser software making a request. WEI could potentially reduce online fraud, but it also presents some significant privacy risks.
  • Trafilatura is a new tool for web scraping that has been designed with quantitative research (for example, assembling training data for language models). It can extract text and metadata from HTML, and generate output in a number of formats.
  • Astro is yet another open source web framework that’s designed for high performance and ease of development.
  • While the “browser wars” are far behind us, it is still difficult for developers to write code that works correctly on all browsers. Baseline is a project of the W3C’s WebDX Community Group that specifies which features web developers can rely on in the most widely used browsers.
  • How Large Language Models Assisted a Website Makeover raises some important questions: When do you stop using ChatGPT and finish the job yourself?  When does your own ability start to atrophy?
  • Remember Flash? It has a museum… And Flash games will run in a modern browser using Ruffle, a Flash Player emulator that is written in WebAssembly.
Security
  • Proof-of-work makes it to the Tor network. It is used as a defense against denial of service attacks. PoW is disabled most of the time, but when traffic seems unusually high, it can switch on, forcing users to “prove” their humanness (actually, their willingness to perform work).
  • A retrospective on this year’s MoveIT attack draws some important conclusions about protecting your assets. Mapping the supply chain, third party risk management, zero trust, and continuous penetration testing are all important parts of a security plan.
  • Bitwarden has released an open source end-to-end encrypted secrets manager. The secrets manager allows safe distribution of API keys, certificates and other sensitive data.
  • The US Government has announced the AI Cybersecurity Challenge (AIxCC). AIxCC is a two year competition to build AI systems that can secure critical software. There’s $18.5 million in prizes, plus the possibility of DARPA funding for up to seven companies.
  • OSC&R is the Open Source Supply Chain Attack Reference, a new project that catalogs and describes techniques used to attack software supply chains. It is modeled on MITRE’s ATT&CK framework.
  • The Lapsus$ group has become one of the most effective threat actors, despite being relatively unsophisticated. They rely on persistence, clever social engineering, and analyzing weak points in an organization’s security posture rather than compromising infrastructure.
  • The NSA has issued a report that gives guidance on how to protect systems against memory safety bugs.
  • Bruce Schneier has an important take on the long-term consequences of the SolarWinds attack. Those consequences include the theft of an Azure customer account signing key that in turn has been used by attackers to access US government email accounts.
  • A new generation of ransomware attacks is targeting IT professionals via fake advertisements for IT tools. While IT professionals are (presumably) more wary and aware than other users, they are also high-value targets.
Hardware
  • Parmesan cheese producers are experimenting with adding microchips to the cheese rind to authenticate genuine cheese.
  • Adoption of RISC-V, a royalty-free open source instruction set architecture for microprocessors, has been increasing. Could it displace ARM?
  • Speculative execution bugs have been discovered for recent Intel (“Downfall”) and AMD (“Inception”) processors. Patches for Linux have been released.
Operations Quantum Computing
  • Peter Shor, inventor of the quantum algorithm for factoring prime numbers (which in turn could be used to break most modern cryptography that isn’t quantum-resistant), has published the lecture notes from the course on quantum computing that he teaches at MIT.
  • A Honeywell quantum computer has been used to find a material that can improve solar cell efficiency. It’s likely that the first applications of quantum computing will involve simulating quantum phenomena rather than pure computation.
Cryptocurrency
  • If you’re interested in iris-scanning WorldCoin, a cryptographer analyzes the privacy promises made by their system. He remains skeptical, but came away less unimpressed than he expected to be.
  • Paypal has introduced a stablecoin that claims to be fully backed by US dollars.
Biology
Categories: Technology

The next generation of developer productivity

O'Reilly Radar - Tue, 2023/08/15 - 03:06

To follow up on our previous survey about low-code and no-code tools, we decided to run another short survey about tools specifically for software developers—including, but not limited to, GitHub Copilot and ChatGPT. We’re interested in how “developer enablement” tools of all sorts are changing the workplace. Our survey 1 showed that while these tools increased productivity, they aren’t without their costs. Both upskilling and retraining developers to use these tools are issues.

Few professional software developers will find it surprising that software development teams are respondents said that productivity is the biggest challenge their organization faced, and another 19% said that time to market and deployment speed are the biggest challenges. Those two answers are almost the same: decreasing time to market requires increasing productivity, and improving deployment speed is itself an increase in productivity. Together, those two answers represented 48% of the respondents, just short of half.

HR issues were the second-most-important challenge, but they’re nowhere near as pressing. 12% of the respondents reported that job satisfaction is the greatest challenge; 11% said that there aren’t good job candidates to hire; and 10% said that employee retention is the biggest issue. Those three challenges total 33%, just one-third of the respondents.


1 Our survey ran from April 18 to April 25, 2023. There were 739 responses.

It’s heartening to realize that hiring and retention are still challenges in this time of massive layoffs, but it’s also important to realize that these issues are less important than productivity.

But the big issue, the issue we wanted to explore, isn’t the challenges themselves; it’s what organizations are doing to meet them. A surprisingly large percentage of respondents (28%) aren’t making any changes to become more productive. But 20% are changing their onboarding and upskilling processes, 15% are hiring new developers, and 13% are using self-service engineering platforms.

We found that the biggest struggle for developers working with new tools is training (34%), and another 12% said the biggest struggle is “ease of use.” Together, that’s almost half of all respondents (46%). That was a surprise, since many of these tools are supposed to be low- or no-code. We’re thinking specifically about tools like GitHub Copilot, Amazon CodeWhisperer, and other code generators, but almost all productivity tools claim to make life simpler. At least at first, that’s clearly not true. There’s a learning curve, and it appears to be steeper than we’d have guessed. It’s also worth noting that 13% of the respondents said that the tools “didn’t effectively solve the problems that developers face.”

Over half of the respondents (51%) said that their organizations are using self-service deployment pipelines to increase productivity. Another 13% said that while they’re using self-service pipelines, they haven’t seen an increase in productivity. So almost two-thirds of the respondents are using self-service pipelines for deployment, and for most of them, the pipelines are working—reducing the overhead required to put new projects into production.

Finally, we wanted to know specifically about the effect of GitHub Copilot, ChatGPT, and other AI-based programming tools. Two-thirds of the respondents (67%) reported that these tools aren’t in use at their organizations. We suspect this estimate is lowballing Copilot’s actual usage. Back in the early 2000s, a widely quoted survey reported that CIOs almost unanimously said that their IT organizations weren’t making use of open source. How little they knew! Actual usage of Copilot, ChatGPT, and similar tools is likely to be much higher than 33%. We’re sure that even if they aren’t using Copilot or ChatGPT on the job, many programmers are experimenting with these tools or using them on personal projects.

What about the 33% who reported that Copilot and ChatGPT are in use at their organizations? First, realize that these are early adopters: Copilot was only released a year and a half ago, and ChatGPT has been out for less than a year. It’s certainly significant that they (and similar tools) have grabbed a third of the market in that short a period. It’s also significant that making a commitment to a new way of programming—and these tools are nothing if not a new kind of programming—is a much bigger change than, say, signing up for a ChatGPT account.

11% of the respondents said their organizations use Copilot and ChatGPT, and that the tools are primarily useful to junior developers; 13% said they’re primarily useful to senior developers. Another 9% said that the tools haven’t yielded an increase in productivity. The difference between junior and senior developers is closer than we expected. Common wisdom is that Copilot is more of an advantage to senior programmers, who are better able to describe the problem they need to solve in an intricate set of prompts and to notice bugs in the generated code quickly. Our survey hints that the difference between senior and junior developers is relatively small—although they’re almost certainly using Copilot in different ways. Junior developers are using it to learn and to spend less time solving problems by looking up solutions on Stack Overflow or searching online documentation. Senior developers are using it to help design and structure systems, and even to create production code.

Is developer productivity an issue? Of course; it always is. Part of the solution is improved tooling: self-service deployment, code-generation tools, and other new technologies and ideas. Productivity tools—and specifically the successors to tools like Copilot—are remaking software development in radical ways. Software developers are getting value from these tools, but don’t let the buzz fool you: that value doesn’t come for free. Nobody’s going to sit down with ChatGPT, type “Generate an enterprise application for selling shoes,” and come away with something worthwhile. Each has its own learning curve, and it’s easy to underestimate how steep that curve can be. Developer productivity tools will be a big part of the future; but to take full advantage of those tools, organizations will need to plan for skills development.

Categories: Technology

The ChatGPT Surge

O'Reilly Radar - Tue, 2023/08/08 - 03:12

I’m sure that nobody will be surprised that the number of searches for ChatGPT on the O’Reilly learning platform skyrocketed after its release in November, 2022. It might be a surprise how quickly it got to the top of our charts: it peaked in May as the 6th most common search query. Then it dropped almost as quickly: it dropped back to #8 in June, and fell further to #19 in July. At its peak, ChatGPT was in very exclusive company: it’s not quite on the level of Python, Kubernetes, and Java, but it’s in the mix with AWS and React, and significantly ahead of Docker.

A look at the number of searches for terms commonly associated with AI shows how dramatic this rise was:

ChatGPT came from nowhere to top all the AI search terms except for Machine Learning itself, which is consistently our #3 search term—and, despite ChatGPT’s dramatic decline in June and July, it’s still ahead of all other search terms relevant to AI. The number of searches for Machine Learning itself held steady, though it arguably declined slightly when ChatGPT appeared. What’s more interesting, though, is that the search term “Generative AI” suddenly emerged from the pack as the third most popular search term. If current trends continue, in August we might see more searches for Generative AI than for ChatGPT.

What can we make of this? Everyone knows that ChatGPT had one of the most successful launches of any software project, passing a million users in its first five days. (Since then, it’s been beaten by Facebook’s Threads, though that’s not really a fair comparison.) There are plenty of reasons for this surge. Talking computers have been a science fiction dream since well before Star Trek—by itself, that’s a good reason for the public’s fascination. ChatGPT might simplify common tasks, from doing research to writing essays to basic programming, so many people want to use it to save labor—though getting it to do quality work is more difficult than it seems at first glance. (We’ll leave the issue of whether this is “cheating” to the users, their teachers, and their employers.) And, while I’ve written frequently about how ChatGPT will change programming, it will undoubtedly have an even greater effect on non-programmers. It will give them the chance to tell computers what to do without programming; it’s the ultimate “low code” experience.

So there are plenty of reasons for ChatGPT to surge. What about other search terms? It’s easy to dismiss these search queries as also-rans, but they were all in the top 300 for May, 2023—and we typically have a few million unique search terms per month. Removing ChatGPT and Machine Learning from the previous graph makes it easier to see trends in the other popular search terms:

It’s mostly “up and to the right.” Three search terms stand out: Generative AI, LLM, and Langchain all follow similar curves: they start off with relatively moderate growth that suddenly becomes much steeper in February, 2023. We’ve already noted that the number of searches for Generative AI increased sharply since the release of ChatGPT, and haven’t declined in the past two months. Our users evidently prefer LLM to spelling out “Large Language Models,” but if you add these two search terms together, the total number of searches for July is within 1% of Generative AI. This surge didn’t really start until last November, when it was spurred by the appearance of ChatGPT—even though search terms like LLM were already in circulation because of GPT-3, DALL-E, StableDiffusion, Midjourney, and other language-based generative AI tools.

Unlike LLMs, Langchain didn’t exist prior to ChatGPT—but once it appeared, the number of searches took off rapidly, and didn’t decline in June and July. That makes sense; although it’s still early, Langchain looks like it will be the cornerstone of LLM-based software development. It’s a widely used platform for building applications that generate queries programmatically and that connects LLMs with each other, with databases, and with other software. Langchain is frequently used to look up relevant articles that weren’t in ChatGPT’s training data and package them as part of a lengthy prompt.

In this group, the only search term that seems to be in a decline is Natural Language Processing. Although large language models clearly fall into the category of NLP, we suspect that most users associate NLP with older approaches to building chatbots. Searches for Artificial Intelligence appear to be holding their own, though it’s surprising that there are so few searches for AI compared to Machine Learning. The difference stems from O’Reilly’s audience, which is relatively technical and prefers the more precise term Machine Learning. Nevertheless, the number of searches for AI rose with the release of ChatGPT, possibly because ChatGPT’s appeal wasn’t limited to the technical community.

Now that we’ve run through the data, we’re left with the big question: What happened to ChatGPT? Why did it decline from roughly 5,000 searches to slightly over 2,500 in a period of two months? There are many possible reasons. Perhaps students stopped using ChatGPT for homework assignments as graduation and summer vacation approached. Perhaps ChatGPT has saturated the world; people know what they need to know, and are waiting for the next blockbuster. An article in Ars Technica notes that ChatGPT usage declined from May to June, and suggests many possible causes, including attention to the Twitter/Threads drama and frustration because OpenAI implemented stricter guardrails to prevent abuse. It would be unfortunate if ChatGPT usage is declining because people can’t use it to generate abusive content, but that’s a different article…

A more important reason for this decline might be that ChatGPT is no longer the only game in town. There are now many alternative language models. Most of these alternatives descend from Meta’s LLaMA and Georgi Gerganov’s llama.cpp (which can run on laptops, cell phones, and even Raspberry Pi). Users can train these models to do whatever they want. Some of these models already have chat interfaces, and all of them could support chat interfaces with some fairly simple programming. None of these alternatives generate significant search traffic at O’Reilly, but that doesn’t mean that they won’t in the future, or that they aren’t an important part of the ecosystem. Their proliferation is an important piece of evidence about what’s happening among O’Reilly’s users. AI developers now need to ask a question that didn’t even exist last November: should they build on large foundation models like ChatGPT or Google’s Bard, using public APIs and paying by the token? Or should they start with an open source model that can run locally and be trained for their specific application?

This last explanation makes a lot of sense in context. We’ve moved beyond the initial phase, when ChatGPT was a fascinating toy. We’re now building applications and incorporating language models into products, so trends in search terms have shifted accordingly. A developer interested in building with large language models needs more context; learning about ChatGPT by itself isn’t enough. Developers who want to learn about language models need different kinds of information, information that’s both deeper and broader. They need to learn about how generative AI works, about new LLMs, about programming with Langchain and other platforms. All of these search terms increased while ChatGPT declined. Now that there are options, and now that everyone has had a chance to try out ChatGPT, the first step in an AI project isn’t to search for ChatGPT. It’s to get a sense of the landscape, to discover the possibilities.

Searches for ChatGPT peaked quickly, and are now declining rapidly—and who knows what August and September will bring? (We wouldn’t be surprised to see ChatGPT bounce back as students return to school and homework assignments.) The real news is that ChatGPT is no longer the whole story: you can’t look at the decline in ChatGPT without also considering what else our users are searching for as they start building AI into other projects. Large language models are very clearly part of the future. They will change the way we work and live, and we’re just at the start of the revolution.

Categories: Technology

Radar Trends to Watch: August 2023

O'Reilly Radar - Tue, 2023/08/01 - 03:04

Artificial Intelligence continues to dominate the news. In the past month, we’ve seen a number of major updates to language models: Claude 2, with its 100,000 token context limit; LLaMA 2, with (relatively) liberal restrictions on use; and Stable Diffusion XL, a significantly more capable version of Stable Diffusion. Does Claude 2’s huge context really change what the model can do? And what role will open access and open source language models have as commercial applications develop?

Artificial Intelligence
  • Stable Diffusion XL is a new generative model that expands on the abilities of Stable Diffusion. It promises shorter, easier prompts; the ability to generate text within images correctly; the ability to be trained on private data; and of course, higher quality output. Try it on clipdrop.
  • OpenAI has withdrawn OpenAI Classifier, a tool that was supposed to detect AI-generated text, because it was not accurate enough.
  • ChatGPT has added a new feature called “Custom Instructions.”  This feature lets users specify an initial prompt that ChatGPT processes prior to any other user-generated prompts; essentially, it’s a personal “system prompt.” Something to make prompt injection more fun.
  • Qualcomm is working with Facebook/Meta to run LLaMA 2 on small devices like phones, enabling AI applications to run locally. The distinction between open source and other licenses will prove much less important than the size of the machine on which the target runs.
  • StabilityAI has released two new large language models, FreeWilly1 and FreeWilly2. They are based on LLaMA and LLaMA 2 respectively. They are called Open Access (as opposed to Open Source), and claim performance similar to GPT 3.5 for some tasks.
  • Chatbot Arena lets chatbots do battle with each other. Users enter prompts, which are sent to two unnamed (randomly chosen?) language models. After the responses have been generated, users can declare a winner, and find out which models have been competing.
  • GPT-4’s ability to generate correct answers to problems may have degraded over the past few months—in particular, its ability to solve mathematical problems and generate correct Python code seems to have suffered. On the other hand, it is more robust against jailbreaking attacks.
  • Facebook/Meta has released Llama 2. While there are fewer restrictions on its use than other models, it is not open source despite Facebook’s claims.
  • Autochain is a lightweight, simpler alternative to Langchain. It allows developers to build complex applications on top of large language models and databases.
  • Elon Musk has announced his new AI company, xAI. Whether this will actually contribute to AI or be another sideshow is anyone’s guess.
  • Anthropic has announced Claude 2, a new version of their large language model. A chat interface is available at claude.ai, and API access is available. Claude 2 allows prompts of up to 100,000 tokens, much larger than other LLMs, and can generate output up to “a few thousand tokens” in length.
  • parsel is a framework that helps large language models do a better job on tasks involving hierarchical multi-step reasoning and problem solving.
  • gpt-prompt-engineer is a tool that reads a description of the task you want an AI to perform, plus a number of test cases. It then generates a large number of prompts about a topic, tests the prompts, and rates the results.
  • LlamaIndex is a data framework (sometimes called an “orchestration framework”) for language models that simplifies the process of indexing a user’s data and using that data to build complex prompts for language models. It can be used with Langchain to build complex AI applications.
  • OpenAI is gradually releasing its Code Interpreter, which will allow ChatGPT to execute any code that it creates, using data provided by the user, and sending output back to the user. Code interpreter reduces hallucinations, errors, and bad math.
  • Humans can now beat AI at Go by finding and exploiting weaknesses in the AI system’s play, tricking the AI into making serious mistakes.
  • Time for existential questions: Does a single banana exist? Midjourney doesn’t think so. Seriously, this is an excellent article about the difficulty of designing prompts that deliver appropriate results.
  • The Jolly Roger Telephone Company has developed GPT–4-based voicebots that you can hire to answer your phone when telemarketers call. If you want to listen in, the results can be hilarious.
  • Apache Spark now has an English SDK. It goes a step beyond tools like CoPilot, allowing you to use English directly when writing code.
  • Humans may be more likely to believe misinformation generated by AI, possibly because AI-generated text is better structured than most human text. Or maybe because AIs are very good at being convincing.
  • OpenOrca is yet another LLaMA-based open source language model and dataset. Its goal is to reproduce the training data for Microsoft’s Orca, which was trained using chain-of-thought prompts and responses from GPT-4. The claim for both Orca models is that it can reproduce GPT-4’s “reasoning” processes.
  • At its developer summit, Snowflake announced Document AI: natural language queries of collections of unstructured documents. This product is based on their own large language model, not an AI provider.
Programming
  • “It works on my machine” has become “It works in my container”: This article has some good suggestions about how to avoid a problem that has plagued computer users for decades.
  • StackOverflow is integrating AI into its products. StackOverflow for Teams now has a chatbot to help solve technical problems, along with a new GenAI StackExchange for discussing generative AI, prompt writing, and related issues.
  • It isn’t news that GitHub can leak private keys and authentication secrets. But a study of the containers available on DockerHub shows that Docker containers also leak keys and secrets, and many of these keys are in active use.
  • Firejail is a Linux tool that can run any process in a private, secure sandbox.
  • Complex and complicated: what’s the difference? It has to do with information, and it’s important to understand in an era of “complex systems.” First in a series.
  • npm-manifest-check is a tool that checks the contents of a package in NPM against the package’s manifest. It is a partial solution to the problem of malicious packages in NPM.
  • Facebook has described their software development platform, much of which they have open sourced. Few developers have to work with software projects this large, but their tools (which include testing frameworks, version control, and a build system) are worth investigating.
  • Polyrhythmix is a command-line program for generating polyrhythmic drum parts. No AI involved.
  • Philip Guo’s “Real-Real-World Programming with ChatGPT” shows what it’s like to use ChatGPT to do a real programming task: what works well, what doesn’t.
Security
  • A research group has found a way to automatically generate attack strings that force large language models to generate harmful content. These attacks work against both open- and closed-source models. It isn’t clear that AI providers can defend against them.
  • The cybercrime syndicate Lazarus Group is running a social engineering attack against JavaScript cryptocurrency developers. Developers are invited to collaborate on a Github project that depends on malicious NPM packages.
  • Language models are the next big thing in cybercrime. A large language model called WormGPT has been developed for use by cybercriminals. It is based on GPT-J. WormGPT is available on the dark web along with thousands of stolen ChatGPT credentials.
  • According to research by MITRE, out-of-bounds writes are among the most dangerous security bugs. They are also the most common, and are consistently at the top of the list. An easy solution to the problem is to use Rust.
Web
  • Another web framework? Enhance claims to be HTML-first, with JavaScript only if you need it. The reality may not be that simple, but if nothing else, it’s evidence of growing dissatisfaction with complex and bloated web applications.
  • Another new browser? Arc rethinks the browsing experience with the ability to switch between groups of tabs and customize individual websites.
  • HTMX provides a way of using HTML attributes to build many advanced web page features, including WebSockets and what we used to call Ajax. All the complexity appears to be packaged into one JavaScript library.
  • There is a law office in the Metaverse, along with a fledgling Metaverse Bar Association. It’s a good place for meetings, although lawyers cannot be licensed to practice in the Metaverse.
  • The European Court of Justice (CJEU) has ruled that Meta’s approach to GDPR compliance is illegal. Meta may not use data for anything other than core functionality without explicit, freely-given consent; consent hidden in the terms of use document does not suffice.
Cryptocurrency
  • Google has updated its policy on Android apps to allow apps to give blockchain-based assets such as NFTs.
  • ChatGPT can be programmed to send Bitcoin payments. As the first commenter points out, this is a fairly simple application of Langchain. But it’s something that was certainly going to happen. But it begs the question: when will we have GPT-based cryptocurrency arbitrage?
Biology
  • Google has developed Med-PaLM M, an attempt at building a “generalist” multimodal AI that has been trained for biomedical applications. Med-PaLM M is still a research project, but may represent a step forward in the application of large language models to medicine.
Materials
  • Room temperature ambient pressure superconductors: This claim has met with a lot of skepticism—but as always, it’s best to wait until another team succeeds or fails to duplicate the results. If this research holds up, it’s a huge step forward.
Categories: Technology

Real-Real-World Programming with ChatGPT

O'Reilly Radar - Tue, 2023/07/25 - 03:49

If you’re reading this, chances are you’ve played around with using AI tools like ChatGPT or GitHub Copilot to write code for you. Or even if you haven’t yet, then you’ve at least heard about these tools in your newsfeed over the past year. So far I’ve read a gazillion blog posts about people’s experiences with these AI coding assistance tools. These posts often recount someone trying ChatGPT or Copilot for the first time with a few simple prompts, seeing how it does for some small self-contained coding tasks, and then making sweeping claims like “WOW this exceeded all my highest hopes and wildest dreams, it’s going to replace all programmers in five years!” or “ha look how incompetent it is … it couldn’t even get my simple question right!”

I really wanted to go beyond these quick gut reactions that I’ve seen so much of online, so I tried using ChatGPT for a few weeks to help me implement a hobby software project and took notes on what I found interesting. This article summarizes what I learned from that experience. The inspiration (and title) for it comes from Mike Loukides’ Radar article on Real World Programming with ChatGPT, which shares a similar spirit of digging into the potential and limits of AI tools for more realistic end-to-end programming tasks.

Setting the Stage: Who Am I and What Am I Trying to Build?

I’m a professor who is interested in how we can use LLMs (Large Language Models) to teach programming. My student and I recently published a research paper on this topic, which we summarized in our Radar article Teaching Programming in the Age of ChatGPT. Our paper reinforces the growing consensus that LLM-based AI tools such as ChatGPT and GitHub Copilot can now solve many of the small self-contained programming problems that are found in introductory classes. For instance, problems like “write a Python function that takes a list of names, splits them by first and last name, and sorts by last name.” It’s well-known that current AI tools can solve these kinds of problems even better than many students can. But there’s a huge difference between AI writing self-contained functions like these and building a real piece of software end-to-end. I was curious to see how well AI could help students do the latter, so I wanted to first try doing it myself.

I needed a concrete project to implement with the help of AI, so I decided to go with an idea that had been in the back of my head for a while now: Since I read a lot of research papers for my job, I often have multiple browser tabs open with the PDFs of papers I’m planning to read. I thought it would be cool to play music from the year that each paper was written while I was reading it, which provides era-appropriate background music to accompany each paper. For instance, if I’m reading a paper from 2019, a popular song from that year could start playing. And if I switch tabs to view a paper from 2008, then a song from 2008 could start up. To provide some coherence to the music, I decided to use Taylor Swift songs since her discography covers the time span of most papers that I typically read: Her main albums were released in 2006, 2008, 2010, 2012, 2014, 2017, 2019, 2020, and 2022. This choice also inspired me to call my project Swift Papers.

Swift Papers felt like a well-scoped project to test how well AI handles a realistic yet manageable real-world programming task. Here’s how I worked on it: I subscribed to ChatGPT Plus and used the GPT-4 model in ChatGPT (first the May 12, 2023 version, then the May 24 version) to help me with design and implementation. I also installed the latest VS Code (Visual Studio Code) with GitHub Copilot and the experimental Copilot Chat plugins, but I ended up not using them much. I found it easier to keep a single conversational flow within ChatGPT rather than switching between multiple tools. Lastly, I tried not to search for help on Google, Stack Overflow, or other websites, which is what I would normally be doing while programming. In sum, this is me trying to simulate the experience of relying as much as possible on ChatGPT to get this project done.

Getting Started: Setup Trials and Tribulations

Here’s the exact prompt I used to start my conversation with ChatGPT using GPT-4:

Act as a software developer to help me build something that will play music from a time period that matches when an academic paper I am reading in the browser was written.

I purposely kept this prompt high-level and underspecified since I wanted ChatGPT to guide me toward design and implementation ideas without me coming in with preconceived notions.

ChatGPT immediately suggested a promising direction—making a browser extension that gets the date of the research paper PDF in the currently-active tab and calls a music streaming API to play a song from that time period. Since I already had a YouTube Music account, I asked whether I could use it, but ChatGPT said that YouTube Music doesn’t have an API. We then brainstormed alternative ideas like using a browser automation tool to programmatically navigate and click on parts of the YouTube Music webpage. ChatGPT gave me some ideas along these lines but warned me that, “It’s important to note that while this approach doesn’t use any official APIs, it’s more brittle and more subject to break if YouTube Music changes their website structure. […] keep in mind that web scraping and browser automation can be complex, and handling all of the edge cases can be a significant amount of work. […] using APIs might be a more reliable and manageable solution.” That warning convinced me to drop this idea. I recalled that ChatGPT had recommended the Spotify Web API in an earlier response, so I asked it to teach me more about what it can do and tell me why I should use it rather than YouTube Music. It seemed like Spotify had what I needed, so I decided to go with it. I liked how ChatGPT helped me work through the tradeoffs of these initial design decisions before diving head-first into coding.

Next we worked together to set up the boilerplate code for a Chrome browser extension, which I’ve never made before. ChatGPT started by generating a manifest.json file for me, which holds the configuration settings that every Chrome extension needs. I didn’t know it at the time, but manifest.json would cause me a bunch of frustration later on. Specifically:

  • ChatGPT generated a manifest.json file in the old Version 2 (v2) format, which is unsupported in the current version of Chrome. For a few years now Google has been transitioning developers to v3, which I didn’t know about since I had no prior experience with Chrome extensions. And ChatGPT didn’t warn me about this. I guessed that maybe ChatGPT only knew about v2 since it was trained on open-source code from before September 2021 (its knowledge cutoff date) and v2 was the dominant format before that date. When I tried loading the v2 manifest.json file into Chrome and saw the error message, I told ChatGPT “Google says that manifest version 2 is deprecated and to upgrade to version 3.” To my surprise, it knew about v3 from its training data and generated a v3 manifest file for me in response. It even told me that v3 is the currently-supported version (not v2!) … yet it still defaulted to v2 without giving me any warning! This frustrated me even more than if ChatGPT had not known about v3 in the first place (in that case I wouldn’t blame it for not telling me something that it clearly didn’t know). This theme of sub-optimal defaults will come up repeatedly—that is, ChatGPT ‘knows’ what the optimal choice is but won’t generate it for me without me asking for it. The dilemma is that someone like me who is new to this area wouldn’t even know what to ask for in the first place.
  • After I got the v3 manifest working in Chrome, as I tried using ChatGPT to help me add more details to my manifest.json file, it tended to “drift” back to generating code in v2 format. I had to tell it a few times to only generate v3 code from now on, and I still didn’t fully trust it to follow my directive. Besides generating code for v2 manifest files, it also generated starter JavaScript code for my Chrome extension that works only with v2 instead of v3, which led to more mysterious errors. If I were to start over knowing what I do now, my initial prompt would have sternly told ChatGPT that I wanted to make an extension using v3, which would hopefully avoid it leading me down this v2 rabbit hole.
  • The manifest file that ChatGPT generated for me declared the minimal set of permissions—it only listed the activeTab permission, which grants the extension limited access to the active browser tab. While this has the benefit of respecting user privacy by minimizing permissions (which is a best practice that ChatGPT may have learned from its training data), it made my coding efforts a lot more painful since I kept running into unexpected errors when I tried adding new functionality to my Chrome extension. Those errors often showed up as something not working as intended, but Chrome wouldn’t necessarily display a permission denied message. In the end, I had to add four additional permissions—”tabs”,  “storage”, “scripting”, “identity”—as well as a separate “host_permissions” field to my manifest.json.

Wrestling with all these finicky details of manifest.json before I could begin any real coding felt like death by a thousand cuts. In addition, ChatGPT generated other starter code in the chat, which I copied into new files in my VS Code project:

Intermission 1: ChatGPT as a Personalized Tutor

As shown above, a typical Chrome extension like mine has at least three JavaScript files: a background script, a content script, and a pop-up script. At this point I wanted to learn more about what all these files are meant to do rather than continuing to obediently copy-paste code from ChatGPT into my project. Specifically, I discovered that each file has different permissions for what browser or page components it can access, so all three must coordinate to make the extension work as intended. Normally I would read tutorials about how this all fits together, but the problem with tutorials is that they are not customized to my specific use case. Tutorials provide generic conceptual explanations and use made-up toy examples that I can’t relate to. So I end up needing to figure out how their explanations may or may not apply to my own context.

In contrast, ChatGPT can generate personalized tutorials that use my own Swift Papers project as the example in its explanations! For instance, when it explained to me what a content script does, it added that “For your specific project, a content script would be used to extract information (the publication date) from the academic paper’s webpage. The content script can access the DOM of the webpage, find the element that contains the publication date, and retrieve the date.” Similarly, it taught me that “Background scripts are ideal for handling long-term or ongoing tasks, managing state, maintaining databases, and communicating with remote servers. In your project, the background script could be responsible for communicating with the music API, controlling the music playback, and storing any data or settings that need to persist between browsing sessions.”

I kept asking ChatGPT follow-up questions to get it to teach me more nuances about how Chrome extensions worked, and it grounded its explanations in how those concepts applied to my Swift Papers project. To accompany its explanations, it also generated relevant example code that I could try out by running my extension. These explanations clicked well in my head because I was already deep into working on Swift Papers. It was a much better learning experience than, say, reading generic getting-started tutorials that walk through creating example extensions like “track your page reading time” or “remove clutter from a webpage” or “manage your tabs better” … I couldn’t bring myself to care about those examples since THEY WEREN’T RELEVANT TO ME! At the time, I cared only about how these concepts applied to my own project, so ChatGPT shined here by generating personalized mini-tutorials on-demand.

Another great side-effect of ChatGPT teaching me these concepts directly within our ongoing chat conversation is that whenever I went back to work on Swift Papers after a few days away from it, I could scroll back up in the chat history to review what I recently learned. This reinforced the knowledge in my head and got me back into the context of resuming where I last left off. To me, this is a huge benefit of a conversational interface like ChatGPT versus an IDE autocomplete interface like GitHub Copilot, which doesn’t leave a trace of its interaction history. Even though I had Copilot installed in VS Code as I was working on Swift Papers, I rarely used it (beyond simple autocompletions) since I liked having a chat history in ChatGPT to refer back to in later sessions.

Next Up: Choosing and Installing a Date Parsing Library

Ideally Swift Papers would infer the date when an academic paper was written by analyzing its PDF file, but that seemed too hard to do since there isn’t a standard place within a PDF where the publication date is listed. Instead what I decided to do was to parse the “landing pages” for each paper that contains metadata such as its title, abstract, and publication date. Many papers I read are linked from a small handful of websites, such as the ACM Digital Library, arXiv, or Google Scholar, so I could parse the HTML of those landing pages to extract publication dates. For instance, here’s the landing page for the classic Beyond being there paper:

I wanted to parse the “Published: 01 June 1992” string on that page to get 1992 as the publication year. I could’ve written this code by hand, but I wanted to try using a JavaScript date parsing library since it would be more robust to date format variations that appear on various websites (e.g., using “22” for the year 2022). Also, since any real-world software project will need to use external libraries, I wanted to see how well ChatGPT could help me choose and install libraries.

ChatGPT suggested two libraries for me: Moment.js and chrono-node. However, it warned me about Moment.js: “as of September 2020, it is considered a legacy project and not recommended for new projects as the team is not planning on doing any new development or maintenance.” I verified this was true by seeing the same warning on the Moment.js homepage. But still, I liked how Moment.js was available as a single self-contained file that I could directly include into my extension without using a package manager like npm or a bundler like webpack (the fewer external tools I needed to set up, the better!). Or so I thought … ChatGPT led me to believe that I could get by without npm and webpack, but later I discovered that this only works in the old Manifest v2 extension format that is no longer supported by Chrome. It turns out that with Manifest v3, all external libraries must be bundled together using tools like npm and webpack in order for the extension to import them. So it goes… once more I got bitten by the fact that ChatGPT was biased toward producing outdated information for v2 rather than the newer v3, presumably because there was a lot more information about v2 in its training data.

Since I had to use npm and webpack anyways, I decided to go with chrono-node since it seemed more robust and up-to-date (no pun intended). I had ChatGPT help me set up my webpack configuration file (webpack.config.js) and things almost seemed to work, except that I got a mysterious error. When I pasted the error message into ChatGPT, it correctly diagnosed the problem, which was something related to (surprise surprise!) webpack’s default settings not being compatible with the security restrictions of my extension’s Manifest v3 format. It also suggested a working fix to webpack.config.js:

This error message is related to the Content Security Policy (CSP) of Chrome extensions. […] By default, Manifest V3 disallows the use of eval() and the Function() constructor, which are typically used for evaluating or creating code from strings […] Webpack, by default, includes a small amount of inline JavaScript code for handling hot module replacement and some other features. This code uses eval() which is not allowed in the context of a Chrome extension. To solve this, you can adjust your webpack configuration to output code in a format that doesn’t rely on eval(). In your webpack.config.js, you can set the devtool option to ‘none’ or use the ‘source-map’ setting […]

Here again ChatGPT showed me that it clearly knew what the problem was (since it told me after I fed it the error message!) and how to fix it. So why didn’t it produce the correct webpack configuration file in the first place?

More generally, several times I’ve seen ChatGPT produce code that I felt might be incorrect. Then when I tell it that there might be a bug in a certain part, it admits its mistake and produces the correct code in response. If it knew that its original code was incorrect, then why didn’t it generate the correct code in the first place?!? Why did I have to ask it to clarify before it admitted its mistake? I’m not an expert at how LLMs work internally, but my layperson guess is that it may have to do with the fact that ChatGPT generates code linearly one token at a time, so it may get ‘stuck’ near local maxima (with code that mostly works but is incorrect in some way) while it is navigating the enormous abstract space of possible output code tokens; and it can’t easily backtrack to correct itself as it generates code in a one-way linear stream. But after it finishes generating code, when the user asks it to review that code for possible errors, it can now “see” and analyze all of that code at once. This comprehensive view of the code may enable ChatGPT to find bugs better, even if it couldn’t avoid introducing those bugs in the first place due to how it incrementally generates code in a one-way stream. (This isn’t an accurate technical explanation, but it’s how I informally think about it.)

Intermission 2: ChatGPT as a UX Design Consultant

Now that I had a basic Chrome extension that could extract paper publication dates from webpages, the next challenge was using the Spotify API to play era-appropriate Taylor Swift songs to accompany these papers. But before embarking on another coding-intensive adventure, I wanted to switch gears and think more about UX (user experience). I got so caught up in the first few hours of getting my extension set up that I hadn’t thought about how this app ought to work in detail. What I needed at this time was a UX design consultant, so I wanted to see if ChatGPT could play this role.

Note that up until now I had been doing everything in one long-running chat session that focused on coding-related questions. That was great because ChatGPT was fully “in the zone” and had a very long conversation (spanning several hours over multiple days) to use as context for generating code suggestions and technical explanations. But I didn’t want all that prior context to influence our UX discussion, so I decided to begin again by starting a brand-new session with the following prompt:

You are a Ph.D. graduate in Human-Computer Interaction and now a senior UX (user experience) designer at a top design firm. Thus, you are very familiar with both the experience of reading academic papers in academia and also designing amazing user experiences in digital products such as web applications. I am a professor who is creating a Chrome Extension for fun in order to prototype the following idea: I want to make the experience of reading academic papers more immersive by automatically playing Taylor Swift songs from the time period when each paper was written while the reader is reading that particular paper in Chrome. I have already set up all the code to connect to the Spotify Web API to programmatically play Taylor Swift songs from certain time periods. I have also already set up a basic Chrome Extension that knows what webpages the user has open in each tab and, if it detects that a webpage may contain metadata about an academic paper then it parses that webpage to get the year the paper was written in, in order to tell the extension what song to play from Spotify. That is the basic premise of my project.

Your job is to serve as a UX design consultant to help me design the user experience for such a Chrome Extension. Do not worry about whether it is feasible to implement the designs. I am an experienced programmer so I will tell you what ideas are or are not feasible to implement. I just want your help with thinking through UX design.

As our session progressed, I was very impressed with ChatGPT’s ability to help me brainstorm how to handle different user interaction scenarios. That said, I had to give it some guidance upfront using my knowledge of UX design: I started by asking it to come up with a few user personas and then to build up some user journeys for each. Given this initial prompting, ChatGPT was able to help me come up with practical ideas that I didn’t originally consider all too well, especially for handling unusual edge cases (e.g., what should happen to the music when the user switches between tabs very quickly?). The back-and-forth conversational nature of our chat made me feel like I was talking to a real human UX design consultant.

I had a lot of fun working with ChatGPT to refine my initial high-level ideas into a detailed plan for how to handle specific user interactions within Swift Papers. The culmination of our consulting session was ChatGPT generating ASCII diagrams of user journeys through Swift Papers, which I could later refer to when implementing this logic in code. Here’s one example:

Reflecting back, this session was productive because I was familiar enough with UX design concepts to steer the conversation towards more depth. Out of curiosity, I started a new chat session with exactly the same UX consultant prompt as above but then played the part of a total novice instead of guiding it:

I don’t know anything about UX design. Can you help me get started since you are the expert?

The conversation that followed this prompt was far less useful since ChatGPT ended up giving me a basic primer on UX Design 101 and offering high-level suggestions for how I can start thinking about the user experience of Swift Papers. I didn’t want to nudge it too hard since I was pretending to be a novice, and it wasn’t proactive enough to ask me clarifying questions to probe deeper. Perhaps if I had prompted it to be more proactive at the start, then it could have elicited more information even from a novice.

This digression reinforces the widely-known consensus that what you get out of LLMs like ChatGPT is only as good as the prompts you’re able to put in. There’s all of this relevant knowledge hiding inside its neural network mastermind of billions and billions of LLM parameters, but it’s up to you to coax it into revealing what it knows by taking the lead in conversations and crafting the right prompts to direct it toward useful responses. Doing so requires a degree of expertise in the domain you’re asking about, so it’s something that beginners would likely struggle with.

The Last Big Hurdle: Working with the Spotify API

After ChatGPT helped me with UX design, the last hurdle I had to overcome was figuring out how to connect my Chrome extension to the Spotify Web API to select and play music. Like my earlier adventure with installing a date parsing library, connecting to web APIs is another common real-world programming task, so I wanted to see how well ChatGPT could help me with it.

The gold standard here is an expert human programmer who has a lot of experience with the Spotify API and who is good at teaching novices. ChatGPT was alright for getting me started but ultimately didn’t meet this standard. My experience here showed me that human experts still outperform the current version of ChatGPT along the following dimensions:

  • Context, context, context: Since ChatGPT can’t “see” my screen, it lacks a lot of useful task context that a human expert sitting beside me would have. For instance, connecting to a web API requires a lot of “pointing-and-clicking” manual setup work that isn’t programming: I had to register for a paid Spotify Premium account to grant me API access, navigate through its web dashboard interface to create a new project, generate API keys and insert them into various places in my code, then register a URL where my app lives in order for authentication to work. But what URL do I use? Swift Papers is a Chrome extension running locally on my computer rather than online, so it doesn’t have a real URL. I later discovered that Chrome extensions export a fake chromiumapp.org URL that can be used for web API authentication. A human expert who is pair programming with me would know all these ultra-specific idiosyncrasies and guide me through pointing-and-clicking on the various dashboards to put all the API keys and URLs in the right places. In contrast, since ChatGPT can’t see this context, I have to explicitly tell it what I want at each step. And since this setup process was so new to me, I had a hard time thinking about how to phrase my questions. A human expert would be able to see me struggling and step in to offer proactive assistance for getting me unstuck.
  • Bird’s-eye view: A human expert would also understand what I’m trying to do—selecting and playing date-appropriate songs—and guide me on how to navigate the labyrinth of the sprawling Spotify API in order to do it. In contrast, ChatGPT doesn’t seem to have as much of a bird’s-eye view, so it eagerly barrels ahead to generate code with specific low-level API calls whenever I ask it something. I, too, am eager to follow its lead since it sounds so confident each time it suggests code along with a convincing explanation (LLMs tend to adopt an overconfident tone, even if their responses may be factually inaccurate). That sometimes leads me on a wild goose chase down one direction only to realize that it’s a dead-end and that I have to backtrack. More generally, it seems hard for novices to learn programming in this piecemeal way by churning through one ChatGPT response after another rather than having more structured guidance from a human expert.
  • Tacit (unwritten) knowledge: The Spotify API is meant to control an already-open Spotify player (e.g., the web player or a dedicated app), not to directly play songs. Thus, ChatGPT told me it was not possible to use it to play songs in the current browser tab, which Swift Papers needed to do. I wanted to verify this for myself, so I went back to “old-school” searching the web, reading docs, and looking for example code online. I found that there was conflicting and unreliable information about whether it’s even possible to do this. And since ChatGPT is trained on text from the internet, if that text doesn’t contain high-quality information about a topic, then ChatGPT won’t work well for it either. In contrast, a human expert can draw upon their vast store of experience from working with the Spotify API in order to teach me tricks that aren’t well-documented online. In this case, I eventually figured out a hack to get playback working by forcing a Spotify web player to open in a new browser tab, using a super-obscure and not-well-documented API call to make that player ‘active’ (or else it sometimes won’t respond to requests to play … that took me forever to figure out, and ChatGPT kept giving me inconsistent responses that didn’t work), and then playing music within that background tab. I feel that humans are still better than LLMs at coming up with these sorts of hacks since there aren’t readily-available online resources to document them. A lot of this hard-earned knowledge is tacit and not written down anywhere, so LLMs can’t be trained on it.
  • Lookahead: Lastly, even in instances when ChatGPT could help out by generating good-quality code, I often had to manually update other source code files to make them compatible with the new code that ChatGPT was giving me. For instance, when it suggested an update to a JavaScript file to call a specific Chrome extension API function, I also had to modify my manifest.json to grant an additional permission before that function call could work (bitten by permissions again!). If I didn’t know to do that, then I would see some mysterious error message pop up, paste it into ChatGPT, and it would sometimes give me a way to fix it. Just like earlier, ChatGPT “knows” the answer here, but I must ask it the right question at every step along the way, which can get exhausting. This is especially a problem for novices since we often don’t know what we don’t know, so we don’t know what to even ask for in the first place! In contrast, a human expert who is helping me would be able to “look ahead” a few steps based on their experience and tell me what other files I need to edit ahead of time so I don’t get bitten by these bugs in the first place.

In the end I got this Spotify API setup working by doing some old-fashioned web searching to supplement my ChatGPT conversation. (I did try the ChatGPT + Bing web search plugin for a bit, but it was slow and didn’t produce useful results, so I couldn’t tolerate it any more and just shut it off.) The breakthrough came as I was browsing a GitHub repository of Spotify Web API example code. I saw an example for Node.js that seemed to do what I wanted, so I copy-pasted that code snippet into ChatGPT and told it to adapt the example for my Swift Papers app (which isn’t using Node.js):

Here’s some example code using Implicit Grant Flow from Spotify’s documentation, which is for a Node.js app. Can you adapt it to fit my chrome extension? [I pasted the code snippet here]

ChatGPT did a good job at “translating” that example into my context, which was exactly what I needed at the moment to get unstuck. The code it generated wasn’t perfect, but it was enough to start me down a promising path that would eventually lead me to get the Spotify API working for Swift Papers. Reflecting back, I later realized that I had manually done a simple form of RAG (Retrieval Augmented Generation) here by using my intuition to retrieve a small but highly-relevant snippet of example code from the vast universe of all code on the internet and then asking a super-specific question about it. (However, I’m not sure a beginner would be able to scour the web to find such a relevant piece of example code like I did, so they would probably still be stuck at this step because ChatGPT alone wasn’t able to generate working code without this extra push from me.)

Epilogue: What Now?

I have a confession: I didn’t end up finishing Swift Papers. Since this was a hobby project, I stopped working on it after about two weeks when my day-job got more busy. However, I still felt like I completed the initial hard parts and got a sense of how ChatGPT could (and couldn’t) help me along the way. To recap, this involved:

  • Setting up a basic Chrome extension and familiarizing myself with the concepts, permission settings, configuration files, and code components that must coordinate together to make it all work.
  • Installing third-party JavaScript libraries (such as a date parsing library) and configuring the npm and webpack toolchain so that these libraries work with Chrome extensions, especially given the strict security policies of Manifest v3.
  • Connecting to the Spotify Web API in such a way to support the kinds of user interactions that I needed in Swift Papers and dealing with the idiosyncrasies of accessing this API via a Chrome extension.
  • Sketching out detailed UX journeys for the kinds of user interactions to support and how Swift Papers can handle various edge cases.

After laying this groundwork, I was able to start getting into the flow of an edit-run-debug cycle where I knew exactly where to add code to implement a new feature, how to run it to assess whether it did what I intended, and how to debug. So even though I stopped working on this project due to lack of time, I got far enough to see how completing Swift Papers would be “just a matter of programming.” Note that I’m not trying to trivialize the challenges involved in programming, since I’ve done enough of it to know that the devil is in the details. But these coding-specific details are exactly where AI tools like ChatGPT and GitHub Copilot shine! So even if I had continued adding features throughout the coming weeks, I don’t feel like I would’ve gotten any insights about AI tools that differ from what many others have already written about. That’s because once the software environment has been set up (e.g., libraries, frameworks, build systems, permissions, API authentication keys, and other plumbing to hook things together), then the task at hand reduces to a self-contained and well-defined programming problem, which AI tools excel at.

In sum, my goal in writing this article was to share my experiences using ChatGPT for the more open-ended tasks that came before my project turned into “just a matter of programming.” Now, some may argue that this isn’t “real” programming since it feels like just a bunch of mundane setup and configuration work. But I believe that if “real-world” programming means creating something realistic with code, then “real-real-world” programming (the title of this article!) encompasses all these tedious and idiosyncratic errands that are necessary before any real programming can begin. And from what I’ve experienced so far, this sort of work isn’t something humans can fully outsource to AI tools yet. Long story short, someone today can’t just give AI a high-level description of Swift Papers and have a robust piece of software magically pop out the other end. I’m sure people are now working on the next generation of AI that can bring us closer to this goal (e.g., much longer context windows with Claude 2 and retrieval augmented generation with Cody), so I’m excited to see what’s in store. Perhaps future AI tool developers could use Swift Papers as a benchmark to assess how well their tool performs on an example real-real-world programming task. Right now, widely-used benchmarks for AI code generation (e.g., HumanEval, MBPP) consist of small self-contained tasks that appear in introductory classes, coding interviews, or programming competitions. We need more end-to-end, real-world benchmarks to drive improvements in these AI tools.

Lastly, switching gears a bit, I also want to think more in the future about how AI tools can teach novices the skills they need to create realistic software projects like Swift Papers rather than doing all the implementation work for them. At present, ChatGPT and Copilot are reasonably good “doers” but not nearly as good at being teachers. This is unsurprising since they were designed to carry out instructions like a good assistant would, not to be an effective teacher who provides pedagogically-meaningful guidance. With the proper prompting and fine-tuning, I’m sure they can do much better here, and organizations like Khan Academy are already customizing GPT-4 to become a personalized tutor. I’m excited to see how things progress in this fast-moving space in the coming months and years. In the meantime, for more thoughts about AI coding tools in education, check out this other recent Radar article that I co-authored, Teaching Programming in the Age of ChatGPT, which summarizes our research paper about this topic.

Categories: Technology

Teaching Programming in the Age of ChatGPT

O'Reilly Radar - Tue, 2023/07/18 - 02:55

Imagine for a minute that you’re a programming instructor who’s spent many hours making creative homework problems to introduce your students to the world of programming. One day, a colleague tells you about an AI tool called ChatGPT. To your surprise (and alarm), when you give it your homework problems, it solves most of them perfectly, maybe even better than you can! You realize that by now, AI tools like ChatGPT and GitHub Copilot are good enough to solve all of your class’s homework problems and affordable enough that any student can use them. How should you teach students in your classes knowing that these AI tools are widely available?

I’m Sam Lau from UC San Diego, and my Ph.D. advisor (and soon-to-be faculty colleague) Philip Guo and I are presenting a research paper at the International Computing Education Research conference (ICER) on this very topic. We wanted to know:

How are computing instructors planning to adapt their courses as more and more students start using AI coding assistance tools such as ChatGPT and GitHub Copilot?

To answer this question, we gathered a diverse sample of perspectives by interviewing 20 introductory programming instructors at universities across 9 countries (Australia, Botswana, Canada, Chile, China, Rwanda, Spain, Switzerland, United States) spanning all 6 populated continents. To our knowledge, our paper is the first empirical study to gather instructor perspectives about these AI coding tools that more and more students will likely have access to in the future.

Here’s a summary of our findings:

Short-Term Plans: Instructors Want to Stop Students from Cheating

Even though we didn’t specifically ask about cheating in our interviews, all of the instructors we interviewed mentioned it as a primary reason to make changes to their courses in the short term. Their reasoning was: If students could easily get answers to their homework questions using AI tools, then they won’t need to think deeply about the material, and thus won’t learn as much as they should. Of course, having an answer key isn’t a new problem for instructors, who have always worried about students copying off each other or online resources like Stack Overflow. But AI tools like ChatGPT generate code with slight variations between responses, which is enough to fool most plagiarism detectors that instructors have available today.

The deeper issue for instructors is that if AI tools can easily solve problems in introductory courses, students who are learning programming for the first time might be led to believe that AI tools can correctly solve any programming task, which can cause them to grow overly reliant on them. One instructor described this as not just cheating, but “cheating badly” because AI tools generate code that’s incorrect in subtle ways that students might not be able to understand.

To discourage students from becoming over-reliant on AI tools, instructors used a mix of strategies, including making exams in-class and on-paper, and also having exams count for more of students’ final grades. Some instructors also explicitly banned AI tools in class, or exposed students to the limitations of AI tools. For example, one instructor copied old homework questions into ChatGPT as a live demo in a lecture and asked students to critique the strengths and weaknesses of the AI-generated code. That said, instructors considered these strategies short-term patches; the sudden appearance of ChatGPT at the end of 2022 meant that instructors needed to make adjustments before their courses started in 2023, which was when we interviewed them for our study.

Longer-Term Plans (Part 1): Ideas to Resist AI Tools

In the next part of our study, instructors brainstormed many ideas about how to approach AI tools longer-term. We split up these ideas into two main categories: ideas that resist AI tools, and ideas that embrace them. Do note that most instructors we interviewed weren’t completely on one side or the other—they shared a mix of ideas from both categories. That said, let’s start with why some instructors talked about resisting AI tools, even in the longer term.

The most common reason for wanting to resist AI tools was the concern that students wouldn’t learn the fundamentals of programming. Several instructors drew an analogy to using a calculator in math class: using AI tools could be like, in the words of one of our interview participants, “giving kids a calculator and they can play around with a calculator, but if they don’t know what a decimal point means, what do they really learn or do with it? They may not know how to plug in the right thing, or they don’t know how to interpret the answer.” Others mentioned ethical objections to AI. For example, one instructor was worried about recent lawsuits around Copilot’s use of open-source code as training data without attribution. Others shared concerns over the training data bias for AI tools.

To resist AI tools practically, instructors proposed ideas for designing “AI-proof” homework assignments, for example, by using a custom-built library for their course. Also, since AI tools are typically trained on U.S./English-centric data, instructors from other countries thought that they could make their assignments harder for AI to solve by including local cultural and language context (e.g. slang) from their countries.

Instructors also brainstormed ideas for AI-proof assessments. One common suggestion was to use in-person paper exams since proctors could better ensure that students were only using paper and pencil. Instructors also mentioned that they could try oral exams where students either talk to a course staff member in-person, or record a video explaining what their code does. Although these ideas were first suggested to help keep assessments meaningful, instructors also pointed out that these assessments could actually improve pedagogy by giving students a reason to think more deeply about why their code works rather than simply trying to get code that produces a correct answer.

Longer-Term Plans (Part 2): Ideas to Embrace AI Tools

Another group of ideas sought to embrace AI tools in introductory programming courses. The instructors we interviewed mentioned several reasons for wanting this future. Most commonly, instructors felt that AI coding tools would become standard for programmers; since “it’s inevitable” that professionals will use AI tools on the job, instructors wanted to prepare students for their future jobs. Related to this, some instructors thought that embracing AI tools could make their institutions more competitive by getting ahead of other universities that were more hesitant about doing so.

Instructors also saw potential learning benefits to using AI tools. For example, if these tools make it so that students don’t need to spend as long wrestling with programming syntax in introductory courses, students could spend more time learning about how to better design and engineer programs. One instructor drew an analogy to compilers: “We don’t need to look at 1’s and 0’s anymore, and nobody ever says, ‘Wow what a big problem, we don’t write machine language anymore!’ Compilers are already like AI in that they can outperform the best humans in generating code.” And in contrast to concerns that AI tools could harm equity and access, some instructors thought that they could make programming less intimidating and thus more accessible by letting students start coding using natural language.

Instructors also saw many potential ways to use AI tools themselves. For example, many taught courses with over a hundred students, where it would be too time-consuming to give individual feedback to each student. Instructors thought that AI tools trained on their class’s data could potentially give personalized help to each student, for example by explaining why a piece of code doesn’t work. Instructors also thought AI tools could help generate small practice problems for their students.

To prepare students for a future where AI tools are widespread, instructors mentioned that they could spend more time in class on code reading and critique rather than writing code from scratch. Indeed, these skills could be useful in the workplace even today, where programmers spend significant amounts of time reading and reviewing other people’s code. Instructors also thought that AI tools gave them the opportunity to give more open-ended assignments, and even have students collaborate with AI directly on their work, where an assignment would ask students to generate code using AI and then iterate on the code until it was both correct and efficient.

Reflections

Our study findings capture a rare snapshot in time in early 2023 as computing instructors are just starting to form opinions about this fast-growing phenomenon but have not yet converged to any consensus about best practices. Using these findings as inspiration, we synthesized a diverse set of open research questions regarding how to develop, deploy, and evaluate AI coding tools for computing education. For instance, what mental models do novices form both about the code that AI generates and about how the AI works to produce that code? And how do those novice mental models compare to experts’ mental models of AI code generation? (Section 7 of our paper has more examples.)

We hope that these findings, along with our open research questions, can spur conversations about how to work with these tools in effective, equitable, and ethical ways.

Check out our paper here and email us if you’d like to discuss anything related to it!
From “Ban It Till We Understand It” to “Resistance is Futile”: How University Programming Instructors Plan to Adapt as More Students Use AI Code Generation and Explanation Tools such as ChatGPT and GitHub Copilot. Sam Lau and Philip J. Guo. ACM Conference on International Computing Education Research (ICER), August 2023.

Categories: Technology

Fearing the Wrong Thing

O'Reilly Radar - Tue, 2023/07/11 - 03:12

There’s a lot of angst about software developers “losing their jobs” to AI, being replaced by a more intelligent version of ChatGPT, GitHub’s Copilot, Google’s Codey, or something similar. Matt Welsh has been talking and writing about the end of programming as such. He’s asking whether large language models eliminate programming as we know it, and he’s excited that the answer is “yes”: eventually, if not in the immediate future. But what does this mean in practice? What does this mean for people who earn their living from writing software?

Some companies will certainly value AI as a tool for replacing human effort, rather than for augmenting human capabilities. Programmers who work for those companies risk losing their jobs to AI. If you work for one of those organizations, I’m sorry for you, but it’s really an opportunity. Despite the well-publicized layoffs, the job market for programmers is great, it’s likely to remain great, and you’re probably better off finding an employer who doesn’t see you as an expense to be minimized. It’s time to learn some new skills and find an employer who really values you.

But the number of programmers who are “replaced by AI” will be small.  Here’s why and how the use of AI will change the discipline as a whole. I did a very non-scientific study of the amount of time programmers actually spend writing code. OK, I just typed “How much of a software developer’s time is spent coding” into the search bar and looked at the top few articles, which gave percentages ranging from 10% to 40%. My own sense, from talking to and observing many people over the years, falls into the lower end of that range: 15% to 20%.

ChatGPT won’t make the 20% of their time that programmers spend writing code disappear completely. You still have to write prompts, and we’re all in the process of learning that if you want ChatGPT to do a good job, the prompts have to be very detailed. How much time and effort does that save? I’ve seen estimates as high as 80%, but I don’t believe them; I think 25% to 50% is more reasonable. If 20% of your time is spent coding, and AI-based code generation makes you 50% more efficient, then you’re really only getting about 10% of your time back. You can use it to produce more code—I’ve yet to see a programmer who was underworked, or who wasn’t up against an impossible delivery date. Or you can spend more time on the “rest of the job,” the 80% of your time that wasn’t spent writing code. Some of that time is spent in pointless meetings, but much of “the rest of the job” is understanding the user’s needs, designing, testing, debugging, reviewing code, finding out what the user really needs (that they didn’t tell you the first time), refining the design, building an effective user interface, auditing for security, and so on. It’s a lengthy list.

That “rest of the job” (particularly the “user’s needs” part) is something our industry has never been particularly good at. Design—of the software itself, the user interfaces, and the data representation—is certainly not going away, and isn’t something the current generation of AI is very good at. We’ve come a long way, but I don’t know anyone who hasn’t had to rescue code that was best described as a “seething mass of bits.” Testing and debugging—well, if you’ve played with ChatGPT much, you know that testing and debugging won’t disappear. AIs generate incorrect code, and that’s not going to end soon. Security auditing will only become more important, not less; it’s very hard for a programmer to understand the security implications of code they didn’t write. Spending more time on these things—and leaving the details of pushing out lines of code to an AI—will surely improve the quality of the products we deliver.

Now, let’s take a really long term view. Let’s assume that Matt Welsh is right, and that programming as we know it will disappear—not tomorrow, but sometime in the next 20 years. Does it really disappear? A couple of weeks ago, I showed Tim O’Reilly some of my experiments with Ethan and Lilach Mollick’s prompts for using AI in the classroom. His reaction was “This prompt is really programming.” He’s right. Writing a detailed prompt really is just a different form of programming. You’re still telling a computer what you want it to do, step by step. And I realized that, after spending 20 years complaining that programming hasn’t changed significantly since the 1970s, ChatGPT has suddenly taken that next step. It isn’t a step towards some new paradigm, whether functional, object oriented, or hyperdimensional. I expected the next step in programming languages to be visual, but it isn’t that either. It’s a step towards a new kind of programming that doesn’t require a formally defined syntax or semantics. Programming without virtual punch cards. Programming that doesn’t require you to spend half your time looking up the names and parameters of library functions that you’ve forgotten about.

In the best of all possible worlds, that might bring the time spent actually writing code down to zero, or close to it. But that best case only saves 20% of a programmer’s time. Furthermore, it doesn’t really eliminate programming. It changes it—possibly making programmers more efficient, and definitely giving programmers more time to talk to users, understand the problems they face, and design good, secure systems for solving those problems. Counting lines of code is less important than understanding problems in depth and figuring out how to solve them—but that’s nothing new. Twenty years ago, the Agile Manifesto pointed in this direction, valuing:

Individuals and interactions over processes and tools
Working software over comprehensive documentation
Customer collaboration over contract negotiation
Responding to change over following a plan

Despite 23 years of “agile practices,” customer collaboration has always been shortchanged. Without engaging with customers and users, Agile quickly collapses to a set of rituals. Will freeing programmers from syntax actually yield more time to collaborate with customers and respond to change? To prepare for this future, programmers will need to learn more about working directly with customers and designing software that meets their needs. That’s an opportunity, not a disaster. Programmers have labored too long under the stigma of being neckbeards who can’t and shouldn’t be allowed to talk to humans. It’s time to reject that stereotype, and to build software as if people mattered.

AI isn’t something to be feared. Writing about OpenAI’s new Code Interpreter plug-in (gradually rolling out now), Ethan Mollick says “My time becomes more valuable, not less, as I can concentrate on what is important, rather than the rote.” AI is something to be learned, tested, and incorporated into programming practices so that programmers can spend more time on what’s really important: understanding and solving problems. The endpoint of this revolution won’t be an unemployment line; it will be better software. The only thing to be feared is failing to make that transition.

Programming isn’t going to go away. It’s going to change, and those changes will be for the better.

Categories: Technology

FOSSY 2023

FAIF - Wed, 2023/07/05 - 04:02

Come to FOSSY 2023!

Show Notes: FOSSY 2023 will happen next week in Portland, OR, USA.

Send feedback and comments on the cast to <oggcast@faif.us>. You can keep in touch with Free as in Freedom by following Conservancy on on Twitter and and FaiF on Twitter. We are working on setting up a group chat again, too!

Free as in Freedom is produced by Dan Lynch of danlynch.org. Theme music written and performed by Mike Tarantino with Charlie Paxson on drums.

The content of this audcast, and the accompanying show notes and music are licensed under the Creative Commons Attribution-Share-Alike 4.0 license (CC BY-SA 4.0).

Categories: Free Software

Radar Trends to Watch: July 2023

O'Reilly Radar - Wed, 2023/07/05 - 03:14

A surprising number of the entries for AI are about generative models that don’t generate text or artwork—specifically, they generate human voices or music. Is voice the next frontier for AI? Google’s AudioPaLM, which unites speech recognition, speech synthesis, and language modeling, may show the direction in which AI is heading. There’s also increasing concern about the consequences of training AI on data that was generated by AI. With less input from real humans, does “model collapse” lead to output that is mediocre at best?

AI
  • RoboCat is an AI model for controlling robots that learns how to learn. Unlike most robotics, which are designed to perform a small number of tasks, RoboCat can learn new tasks after it is deployed, and the learning process speeds up as it learns more tasks.
  • AudioPaLM is a new language model from Google that combines speech generation, speech understanding, and natural language processing. It’s a large language model that understands and produces voice.
  • Voicemod is a tool for turning human speech into AI-generated speech in real time. The company offers a number of “sonic avatars” that can be further customized.
  • Tree-of-thought prompting expands on chain-of-thought by causing language models to consider multiple reasoning paths in the process of generating an output.
  • Facebook/Meta has built a new generative speech model called Voicebox that they claim surpasses the performance of other models. They have not released an open source version. The paper describes some ways to distinguish generated speech from human speech.
  • MIT Technology Review provides a good summary of key points in the EU’s draft proposal for regulating AI. It will probably take at least two years for this proposal to move through legislative channels.
  • OpenLLM provides support for running a number of open source large language models in production. It includes the ability to integrate with tools like Bento; support for langchain is promised soon.
  • Infinigen is a photorealistic natural-world 3D scene generator. It is designed to generate synthetic training data for AI systems. It currently generates terrains, plants, animals, and natural phenomena like weather; built objects may be added later.
  • Facebook/Meta has created a new large model called I-JEPA (Image Joint Embedding Predictive Architecture). It claims to be more efficient than other models, and to work by building a higher-level model of the world, as humans do. It is a first step towards implementing Yann Lecun’s ideas about next-generation artificial intelligence.
  • MusicGen is a new generative model for music from Facebook/Meta. It sounds somewhat more convincing than other music models, but it’s not clear that it can do more than reassemble musical cliches.
  • OpenAI has added a “function calling” API. The API allows an application to describe functions to the model. If GPT needs to call one of those functions, it returns a JSON object describing the function call. The application can call the function and return the result to the model.
  • A study claims that AWS Mechanical Turk workers are using AI to do their work. Mechanical Turk is often used to generate or label training data for AI systems. What impact will the use of AI to generate training data have on future generations of AI?
  • What happens when generative AI systems are trained on data that they’ve produced? When Copilot is trained on code generated by Copilot, or GPT-4 on web content generated by GPT-4? Model collapse: the “long tails” of the distribution disappear, and the quality of the output suffers.
  • FrugalGPT is an idea for reducing the cost of using large language models like GPT-4. The authors propose using pipeline of language models (GPT-J, GPT-3, and GPT-4), refining the prompt at each stage so that most of the processing is done by free or inexpensive models.
  • Deep Mind’s AlphaDev has used AI to speed up sorting algorithms. Their software worked at the assembly language level; when they were done, they converted the code to back to C++ and submitted it to the LLVM project, which has included it in the C++ standard library.
  • An artist has used Stable Diffusion to create functional QR codes that are also works of art and posted them on Reddit.
  • The movement to regulate AI needs to learn from nuclear non-proliferation, where the key element isn’t hypothetical harms (we all know what bombs can do), but traceability and transparency. Model Cards and Datasheets for Datasets are a good start.
  • Sam Altman talks about ChatGPT’s plans, saying that it’s currently compute-bound and needs more GPUs. This bottleneck is delaying features like custom fine-tuning the model, expanding the context window, and multimodality (i.e., images).
  • Facebook/Meta’s LIMA is a 65B parameter language model that’s based on LLaMa, but was fine-tuned on only 1,000 carefully chosen prompts and responses, without the use of RLHF (reinforcement learning with human feedback).
  • Some things have to happen. Gandalf is a prompt injection game; your task is to get an AI to reveal its password.
Programming
  • Leptos is a new open source, full-stack, fully typed web framework for Rust. (How many days is it since the last Web Platform?)
  • In the not-too-distant future, WebAssembly may replace containers; software deployed as WebAssembly is portable and much smaller.
  • Adam Jacob talks about revitalizing DevOps with a new generation of tooling that uses insights from multiplayer games and digital twins.
  • Alex Russell on improving web performance for the majority of users, who have midrange or low-end smartphones: JavaScript is useful, but on many sites it is a huge burden.
  • Doug Crockford says that it’s time to stop using JavaScript and move on to newer, better, next-generation programming languages.
  • Wing is a new programming language with high-level abstractions for the cloud. The claim is that these abstractions will make it easier for AI code generation to write cloud-native programs.
  • Simpleaichat is a Python package that simplifies writing programs that use GPT 3.5 or GPT 4.
  • StarCoder and StarCoderBase form an open source language model for writing software (similar to Codex). It was trained on “​​a large collection of permissively licensed GitHub repositories with inspection tools and an opt-out process.”
  • How do you measure developer experience? Metrics tend to be technical, ignoring personal issues like developer satisfaction, the friction they encounter day-to-day, and other aspects of lived experience.
  • OpenChat is an open source chat console that is designed to connect to a large language model (currently GPT-*). It allows anyone to create their own customized chat bot. It supports unlimited memory (using PineconeDB), and plans to add support for other language models.
  • WebAssembly promises to improve runtime performance and latency on both the browser and the back end. It also promises to allow developers to create packages that run in any environment: Kubernetes clusters, edge devices, etc. But this capability is still a work in progress.
  • People have started talking about software defined cars. This is an opportunity to rethink security from the ground up—or to create a much bigger attack surface.
  • LQML is a programming language designed for prompting language models. It’s an early example of a formal informal language for communicating with AI systems.
  • Memory Spy is a web application that runs simple C programs and shows you how variables are represented in memory. Even if you aren’t a C programmer, you will learn a lot about how software works. Memory Spy was created by Julia Evans, @b0rk. Julia’s latest zine about how computers represent integer and floating point numbers is also well worth reading.
Augmented and Virtual Reality
  • David Pogue’s review of Apple Vision, the $3500 AR headset: Limited in a way that’s reminiscent of the first iPhone—“But no headset, no device, has ever hit this high a number on the wonder scale before.”
  • Apple did it: they unveiled their AR/VR goggles. They are very expensive ($3499), look something like skiing googles, and have two hours of battery life on an external battery pack. It’s hard to imagine wearing them in public, though Apple may manage to make them fashionable.
  • Apple’s big challenge with the Vision Pro goggles may not be getting people to use them; it may be getting developers to write compelling apps. Merely translating 2D apps into a 3D environment isn’t likely to be satisfactory. How can software really take advantage of 3D?
  • Tim Bray’s post on what Augmented Reality is, and what that will require from software developers, is a must-read. It’s not Apple Vision.
  • Hachette has created a Metaverse experience named “Beyond the Pages,” in part as an attempt to attract a younger audience. While the original experience was only open for two days, they have promised to schedule more.
Security
  • Ransomware is getting faster, which means that organizations have even less time to respond to an attack. To prevent becoming a victim, focus on the basics: access controls, strong passwords, multi-factor authentication, zero trust, penetration testing, and good backups.
  • The number of attacks against systems running in “the cloud” is increasing rapidly. The biggest dangers are still errors in basic hygiene, including misconfigured identity and access management.
  • AI Package Hallucination is a new technique for distributing malware. Ask a question that causes an AI to hallucinate a package or library. Create malware with that package name, and put it in an appropriate repository. Wait for someone else to get the same recommendation and install the malware. (This assumes AI hallucinations are consistent; I’m not sure that’s true.)
Web
  • A new standard allows NFTs to contain wallets, which contain NFTs. Users build collections of related resources. In addition to gaming (a character that “owns” its paraphernalia), this could be used for travel (a trip that contains tickets to events) or customer loyalty programs.
  • The W3C has announced a new web standard for secure payment confirmation. The standard is intended to make checkout simpler and less prone to fraud.
  • Tyler Cowen argues that cryptocurrency will play a role for transactions between AI systems. AI systems aren’t allowed have their own bank accounts, and that’s unlikely to change in the near future. However, as they come into wider use, they will need to make transactions.
  • Web or mobile performance isn’t discussed as much as it should be. Here’s a good post on on improving Wikipedia performance by eliminating a specific blocking problem: removing unnecessary JavaScript, and optimizing what remains.
Quantum Computing
Categories: Technology

Risk Management for AI Chatbots

O'Reilly Radar - Tue, 2023/06/27 - 03:06

Does your company plan to release an AI chatbot, similar to OpenAI’s ChatGPT or Google’s Bard? Doing so means giving the general public a freeform text box for interacting with your AI model.

That doesn’t sound so bad, right? Here’s the catch: for every one of your users who has read a “Here’s how ChatGPT and Midjourney can do half of my job” article, there may be at least one who has read one offering “Here’s how to get AI chatbots to do something nefarious.” They’re posting screencaps as trophies on social media; you’re left scrambling to close the loophole they exploited.

Welcome to your company’s new AI risk management nightmare.

So, what do you do? I’ll share some ideas for mitigation. But first, let’s dig deeper into the problem.

Old Problems Are New Again

The text-box-and-submit-button combo exists on pretty much every website. It’s been that way since the web form was created roughly thirty years ago. So what’s so scary about putting up a text box so people can engage with your chatbot?

Those 1990s web forms demonstrate the problem all too well. When a person clicked “submit,” the website would pass that form data through some backend code to process it—thereby sending an e-mail, creating an order, or storing a record in a database. That code was too trusting, though. Malicious actors determined that they could craft clever inputs to trick it into doing something unintended, like exposing sensitive database records or deleting information. (The most popular attacks were cross-site scripting and SQL injection, the latter of which is best explained in the story of “Little Bobby Tables.”)

With a chatbot, the web form passes an end-user’s freeform text input—a “prompt,” or a request to act—to a generative AI model. That model creates the response images or text by interpreting the prompt and then replaying (a probabilistic variation of) the patterns it uncovered in its training data.

That leads to three problems:

  1. By default, that underlying model will respond to any prompt.  Which means your chatbot is effectively a naive person who has access to all of the information from the training dataset. A rather juicy target, really. In the same way that bad actors will use social engineering to fool humans guarding secrets, clever prompts are a form of  social engineering for your chatbot. This kind of prompt injection can get it to say nasty things. Or reveal a recipe for napalm. Or divulge sensitive details. It’s up to you to filter the bot’s inputs, then.
  2. The range of potentially unsafe chatbot inputs amounts to “any stream of human language.” It just so happens, this also describes all possible chatbot inputs. With a SQL injection attack, you can “escape” certain characters so that the database doesn’t give them special treatment. There’s currently no equivalent, straightforward way to render a chatbot’s input safe. (Ask anyone who’s done content moderation for social media platforms: filtering specific terms will only get you so far, and will also lead to a lot of false positives.)
  3. The model is not deterministic. Each invocation of an AI chatbot is a probabilistic journey through its training data. One prompt may return different answers each time it is used. The same idea, worded differently, may take the bot down a completely different road. The right prompt can get the chatbot to reveal information you didn’t even know was in there. And when that happens, you can’t really explain how it reached that conclusion.

Why haven’t we seen these problems with other kinds of AI models, then? Because most of those have been deployed in such a way that they are only communicating with trusted internal systems. Or their inputs pass through layers of indirection that structure and limit their shape. Models that accept numeric inputs, for example, might sit behind a filter that only permits the range of values observed in the training data.

What Can You Do?

Before you give up on your dreams of releasing an AI chatbot, remember: no risk, no reward.

The core idea of risk management is that you don’t win by saying “no” to everything. You win by understanding the potential problems ahead, then figure out how to steer clear of them. This approach reduces your chances of downside loss while leaving you open to the potential upside gain.

I’ve already described the risks of your company deploying an AI chatbot. The rewards include improvements to your products and services, or streamlined customer service, or the like. You may even get a publicity boost, because just about every other article these days is about how companies are using chatbots.

So let’s talk about some ways to manage that risk and position you for a reward. (Or, at least, position you to limit your losses.)

Spread the word: The first thing you’ll want to do is let people in the company know what you’re doing. It’s tempting to keep your plans under wraps—nobody likes being told to slow down or change course on their special project—but there are several people in your company who can help you steer clear of trouble. And they can do so much more for you if they know about the chatbot long before it is released.

Your company’s Chief Information Security Officer (CISO) and Chief Risk Officer will certainly have ideas. As will your legal team. And maybe even your Chief Financial Officer, PR team, and head of HR, if they have sailed rough seas in the past.

Define a clear terms of service (TOS) and acceptable use policy (AUP): What do you do with the prompts that people type into that text box? Do you ever provide them to law enforcement or other parties for analysis, or feed it back into your model for updates? What guarantees do you make or not make about the quality of the outputs and how people use them? Putting your chatbot’s TOS front-and-center will let people know what to expect before they enter sensitive personal details or even confidential company information. Similarly, an AUP will explain what kinds of prompts are permitted.

(Mind you, these documents will spare you in a court of law in the event something goes wrong. They may not hold up as well in the court of public opinion, as people will accuse you of having buried the important details in the fine print. You’ll want to include plain-language warnings in your sign-up and around the prompt’s entry box so that people can know what to expect.)

Prepare to invest in defense: You’ve allocated a budget to train and deploy the chatbot, sure. How much have you set aside to keep attackers at bay? If the answer is anywhere close to “zero”—that is, if you assume that no one will try to do you harm—you’re setting yourself up for a nasty surprise. At a bare minimum, you will need additional team members to establish defenses between the text box where people enter prompts and the chatbot’s generative AI model. That leads us to the next step.

Keep an eye on the model: Longtime readers will be familiar with my catchphrase, “Never let the machines run unattended.” An AI model is not self-aware, so it doesn’t know when it’s operating out of its depth. It’s up to you to filter out bad inputs before they induce the model to misbehave.

You’ll also need to review samples of the prompts supplied by end-users (there’s your TOS calling) and the results returned by the backing AI model. This is one way to catch the small cracks before the dam bursts. A spike in a certain prompt, for example, could imply that someone has found a weakness and they’ve shared it with others.

Be your own adversary: Since outside actors will try to break the chatbot, why not give some insiders a try? Red-team exercises can uncover weaknesses in the system while it’s still under development.

This may seem like an invitation for your teammates to attack your work. That’s because it is. Better to have a “friendly” attacker uncover problems before an outsider does, no?

Narrow the scope of audience: A chatbot that’s open to a very specific set of users—say, “licensed medical practitioners who must prove their identity to sign up and who use 2FA to login to the service”—will be tougher for random attackers to access. (Not impossible, but definitely tougher.) It should also see fewer hack attempts by the registered users because they’re not looking for a joyride; they’re using the tool to complete a specific job.

Build the model from scratch (to narrow the scope of training data): You may be able to extend an existing, general-purpose AI model with your own data (through an ML technique called transfer learning). This approach will shorten your time-to-market, but also leave you to question what went into the original training data. Building your own model from scratch gives you complete control over the training data, and therefore, additional influence (though, not “control”) over the chatbot’s outputs.

This highlights an added value in training on a domain-specific dataset: it’s unlikely that anyone would, say, trick the finance-themed chatbot BloombergGPT into revealing the secret recipe for Coca-Cola or instructions for acquiring illicit substances. The model can’t reveal what it doesn’t know.

Training your own model from scratch is, admittedly, an extreme option. Right now this approach requires a combination of technical expertise and compute resources that are out of most companies’ reach. But if you want to deploy a custom chatbot and are highly sensitive to reputation risk, this option is worth a look.

Slow down: Companies are caving to pressure from boards, shareholders, and sometimes internal stakeholders to release an AI chatbot. This is the time to remind them that a broken chatbot released this morning can be a PR nightmare before lunchtime. Why not take the extra time to test for problems?

Onward

Thanks to its freeform input and output, an AI-based chatbot exposes you to additional risks above and beyond using other kinds of AI models. People who are bored, mischievous, or looking for fame will try to break your chatbot just to see whether they can. (Chatbots are extra tempting right now because they are novel, and “corporate chatbot says weird things” makes for a particularly humorous trophy to share on social media.)

By assessing the risks and proactively developing mitigation strategies, you can reduce the chances that attackers will convince your chatbot to give them bragging rights.

I emphasize the term “reduce” here. As your CISO will tell you, there’s no such thing as a “100% secure” system. What you want to do is close off the easy access for the amateurs, and at least give the hardened professionals a challenge.

Many thanks to Chris Butler and Michael S. Manley for reviewing (and dramatically improving) early drafts of this article. Any rough edges that remain are mine.

Categories: Technology

AI’s Opaque Box Is Actually a Supply Chain

O'Reilly Radar - Tue, 2023/06/20 - 03:24

Understanding AI’s mysterious “opaque box” is paramount to creating explainable AI. This can be simplified by considering that AI, like all other technology, has a supply chain. Knowing what makes up the supply chain is critical to enforcing the security of the AI system, establishing trust with the consumer of the AI’s output, and protecting your organization from undue risk.

When pondering your approach to dissecting AI’s supply chain, consider how production, shipping, delivery, and invoicing are steps in just about any supply chain, for everything that you use, from toothpaste to technology. AI models are also created and delivered via supply chains.

Some of the steps in AI’s type of supply chain can be tricky to follow, with special gotchas like technology company trade secrets, closed code, and program synthesis—which is the process of AI writing its own code to improve itself. Combined with continuous machine learning cycles and deployments, reviews, and recalls, there are a lot of opportunities to bring transparency to the opaque box.

Blockchain technology is chosen by companies like Walmart to bring transparency to supply chains like food production and delivery, because it is tamper evident and distributed. Blockchain technology is used in an enterprise stack alongside other systems, to make integrations more secure and to establish a single audit trail. Verification, including that of the identity of all participants in a blockchain network, and compliance are woven throughout the workflow and processes.

Typically, an enterprise blockchain audit trail will consist of linked blocks containing transactions that reference hyperlinks to data that is stored off chain in traditional databases. Meanwhile, the system creates a cryptographic verification of that data and stores the verification on blockchain, which is comparable to the traditional process of providing a checksum to ensure integrity of a file download. If data on the blockchain network ever undergoes tampering, the cryptographic hash used as verification will no longer compute to the same value.

When you dissect AI’s supply chain, at the root, you will find algorithms. These are the mathematical formulas written to simulate functions of the brain, which underlie the AI programming. The algorithms are compiled into code libraries, and then distributed to AI developers who use them to write custom AI models. Meanwhile, a data scientist acquires and prepares training data, which is then used to bring the AI model to life.

University of Baltimore Law Professor Michelle Gillman, who fights to help people who were automatically denied benefits, recently spoke with NBC about the importance of understanding the origin of algorithms when managing AI risk. According to Gillman, whose clients often face life and death situations that are being decided by AI, “I’ve been in hearings where no one in the room can describe for me, how does the algorithm work? What factors does it weigh? How does it weigh those factors? And so you are really left unable to make a case for your client in those circumstances.”

Next, a workflow begins that implements an AI engineering and machine learning operations (MLOps) process, in which cycles of experiments and deployments are conducted, and the AI model, its data, and the variables, or hyperparameters of the experiment are tested, tweaked, and improved. This part of the supply chain keeps going in a cycle even after delivery to the consumer, since the training and improvement process is generally continuous. The consumer’s input in the form of reviews and ratings becomes part of the process to improve the model. The stakeholders of the project, such as the management of the organization that built the AI model, may also add their input and follow through to make sure it is considered.

If an organization is large, the AI model’s supply chain can involve extended teams and even multiple organizations. Or, it is entirely possible that by using cloud services and AI marketplaces, a single developer can perform all of these functions alone. In any case, you can add an enterprise blockchain technology, like Hyperledger Fabric, to the stack so you can track, trace, audit, and even recall your model.

An enterprise blockchain network is sometimes used to bring transparency to the supply chain. This helps network participants trust one another because they are members of the same blockchain network. The blockchain network is also really helpful when something goes wrong and a product needs to be quickly traced to its origin.

In the case of Walmart, they pioneered the use of enterprise blockchain to track and trace food that potentially carried a foodborne illness. For example, if a customer became sick from a package of sliced mangoes in any Walmart store, the mangoes had to be discarded at all of the stores because it took more than 6 days to trace the affected shipment. The new blockchain network cut this time to 2.2 seconds, saving Walmart the expense of discarding good mangoes. Walmart continues with their supply chain blockchain strategy today, which has become the foundation of automated payment systems for their many suppliers.

When this strategy is applied to AI’s opaque box, the convenience of a supply chain blockchain network will help you to track and trace important factors like the reason why an AI model’s intent or domain has drifted, or to learn what type of treatment was given to data that was used to produce a certain outcome. As explained in the O’Reilly book I co-authored, Blockchain Tethered AI, there are four blockchain controls for AI, which are:

  • Control 1, which is pre-establishing identity and workflow criteria for people and systems, can be used with AI to verify that data and models have not undergone tampering or corruption. This control includes criteria for telling humans apart from AI models.
  • Control 2 addresses distributing tamper-evident verification, which can be used with AI to make sure that the right people, systems, or intelligent agents—with the right authorization—are the only ones that participate in governance of or modification to the AI. This control can be used to create a tamper-evident audit trail of training data, even if that data is supplied by users in the form of chat history, as is the case with ChatGPT. A record can be stored on blockchain indicating whether the user has agreed to have their chat history used as training data or not, and if a chat is used as training data, the prompts within it can be reviewed by a human or intelligent agent for issues such as ethics violations, data sabotage, or other issues before it is used.
  • Control 3 involves governing, instructing, and inhibiting intelligent agents, and will become very important when wanting to trace or reverse AI, or prove in court that the output of AI is traceable to certain people or organizations. Reviews and ratings of how a model is performing can help to catch and address inappropriate or unethical output.
  • Control 4 is showing authenticity through user-viewable provenance, which will be especially important in using branded AI that has underlying components which come from distributed marketplaces. Understanding and proving provenance is a factor in legal issues involving AI.

This ability to track and trace can also be extended to the consumer, through the display of trust logos. The concept of trust logos, which are the long-time hallmark of e-commerce security, can be applied to AI by connecting the logos to the underlying blockchain network, and programming them to alert consumers should the AI model become compromised. A similar method could be used to show whether a customer service representative is an AI or a human.

Keep in mind that people in different roles may need different types of information in order to trust AI models. Depending on the perspective of the entity requesting the information, different levels of traceability could be desired. For example, a person answering their phone should be able to see an indicator as to whether a caller is AI, and whether or not the AI is from a trustworthy source. An engineer deciding whether or not to integrate AI components with their models would want a much deeper understanding of the supply chain, and a stakeholder might want to see if the reviews and comments are authentic and find out what is being done to address any recalls. This also brings up the question of a special handshake to enable AI models to trust one another and establish boundaries.

Even though you might not know everything about your AI model, you can commit the facts you do know to blockchain. Develop an AI factsheet as described in Chapter 1 of Blockchain Tethered AI. If you have used models that you have downloaded from marketplaces, you can typically find an AI model card and data cards that provide basic facts about the materials you are using. Also, you can always document that a part of the model is indeed “opaque,” and complete that part later once the details are known.

You can implement your blockchain network for your AI model’s supply chain in the same way that enterprise blockchain networks are used by developers for other purposes. You only need to record cryptographic verifications on your blockchain network, while storing the actual components of the AI off-chain. The code that comes with Blockchain Tethered AI can help you to visualize and implement this architecture.

This blockchain verification, which works similar to a checksum you might see when downloading a file, can be checked against the model and its components at any point to see if they have undergone any tampering. This type of use of a blockchain network doesn’t involve cryptocurrency or miners or use any unusually high amounts of energy to run, and should be thought of instead as a distributed text-based super-log that is automated by smart contracts.

Being able to track and trace goods in this way helps prevent sales of counterfeit goods, helps food companies to recall items quickly without having to throw everything away, and helps artists, musicians, and content creators be paid for their work. When applying these techniques and controls to make AI’s opaque box explainable, your AI models will enjoy the competitive advantage of being trackable, traceable, controllable, and even stoppable.

Categories: Technology

The Alignment Problem Is Not New

O'Reilly Radar - Thu, 2023/06/15 - 12:35

“Mitigating the risk of extinction from A.I. should be a global priority alongside other societal-scale risks, such as pandemics and nuclear war,” according to a statement signed by more than 350 business and technical leaders, including the developers of today’s most important AI platforms.

Among the possible risks leading to that outcome is what is known as “the alignment problem.” Will a future super-intelligent AI share human values, or might it consider us an obstacle to fulfilling its own goals? And even if AI is still subject to our wishes, might its creators—or its users—make an ill-considered wish whose consequences turn out to be catastrophic, like the wish of fabled King Midas that everything he touches turn to gold? Oxford philosopher Nick Bostrom, author of the book Superintelligence, once posited as a thought experiment an AI-managed factory given the command to optimize the production of paperclips. The “paperclip maximizer” comes to monopolize the world’s resources and eventually decides that humans are in the way of its master objective.

Far-fetched as that sounds, the alignment problem is not just a far future consideration. We have already created a race of paperclip maximizers. Science fiction writer Charlie Stross has noted that today’s corporations can be thought of as “slow AIs.” And much as Bostrom feared, we have given them an overriding command: to increase corporate profits and shareholder value. The consequences, like those of Midas’s touch, aren’t pretty. Humans are seen as a cost to be eliminated. Efficiency, not human flourishing, is maximized.

In pursuit of this overriding goal, our fossil fuel companies continue to deny climate change and hinder attempts to switch to alternative energy sources, drug companies peddle opioids, and food companies encourage obesity. Even once-idealistic internet companies have been unable to resist the master objective, and in pursuing it have created addictive products of their own, sown disinformation and division, and resisted attempts to restrain their behavior.

Even if this analogy seems far fetched to you, it should give you pause when you think about the problems of AI governance.

Corporations are nominally under human control, with human executives and governing boards responsible for strategic direction and decision-making. Humans are “in the loop,” and generally speaking, they make efforts to restrain the machine, but as the examples above show, they often fail, with disastrous results. The efforts at human control are hobbled because we have given the humans the same reward function as the machine they are asked to govern: we compensate executives, board members, and other key employees with options to profit richly from the stock whose value the corporation is tasked with maximizing. Attempts to add environmental, social, and governance (ESG) constraints have had only limited impact. As long as the master objective remains in place, ESG too often remains something of an afterthought.

Much as we fear a superintelligent AI might do, our corporations resist oversight and regulation. Purdue Pharma successfully lobbied regulators to limit the risk warnings planned for doctors prescribing Oxycontin and marketed this dangerous drug as non-addictive. While Purdue eventually paid a price for its misdeeds, the damage had largely been done and the opioid epidemic rages unabated.

What might we learn about AI regulation from failures of corporate governance?

  1. AIs are created, owned, and managed by corporations, and will inherit their objectives. Unless we change corporate objectives to embrace human flourishing, we have little hope of building AI that will do so.
  2. We need research on how best to train AI models to satisfy multiple, sometimes conflicting goals rather than optimizing for a single goal. ESG-style concerns can’t be an add-on, but must be intrinsic to what AI developers call the reward function. As Microsoft CEO Satya Nadella once said to me, “We [humans] don’t optimize. We satisfice.” (This idea goes back to Herbert Simon’s 1956 book Administrative Behavior.) In a satisficing framework, an overriding goal may be treated as a constraint, but multiple goals are always in play. As I once described this theory of constraints, “Money in a business is like gas in your car. You need to pay attention so you don’t end up on the side of the road. But your trip is not a tour of gas stations.” Profit should be an instrumental goal, not a goal in and of itself. And as to our actual goals, Satya put it well in our conversation: “the moral philosophy that guides us is everything.”
  3. Governance is not a “once and done” exercise. It requires constant vigilance, and adaptation to new circumstances at the speed at which those circumstances change. You have only to look at the slow response of bank regulators to the rise of CDOs and other mortgage-backed derivatives in the runup to the 2009 financial crisis to understand that time is of the essence.

OpenAI CEO Sam Altman has begged for government regulation, but tellingly, has suggested that such regulation apply only to future, more powerful versions of AI. This is a mistake. There is much that can be done right now.

We should require registration of all AI models above a certain level of power, much as we require corporate registration. And we should define current best practices in the management of AI systems and make them mandatory, subject to regular, consistent disclosures and auditing, much as we require public companies to regularly disclose their financials.

The work that Timnit Gebru, Margaret Mitchell, and their coauthors have done on the disclosure of training data (“Datasheets for Datasets”) and the performance characteristics and risks of trained AI models (“Model Cards for Model Reporting”) are a good first draft of something much like the Generally Accepted Accounting Principles (and their equivalent in other countries) that guide US financial reporting. Might we call them “Generally Accepted AI Management Principles”?

It’s essential that these principles be created in close cooperation with the creators of AI systems, so that they reflect actual best practice rather than a set of rules imposed from without by regulators and advocates. But they can’t be developed solely by the tech companies themselves. In his book Voices in the Code, James G. Robinson (now Director of Policy for OpenAI) points out that every algorithm makes moral choices, and explains why those choices must be hammered out in a participatory and accountable process. There is no perfectly efficient algorithm that gets everything right. Listening to the voices of those affected can radically change our understanding of the outcomes we are seeking.

But there’s another factor too. OpenAI has said that “Our alignment research aims to make artificial general intelligence (AGI) aligned with human values and follow human intent.” Yet many of the world’s ills are the result of the difference between stated human values and the intent expressed by actual human choices and actions. Justice, fairness, equity, respect for truth, and long-term thinking are all in short supply. An AI model such as GPT4 has been trained on a vast corpus of human speech, a record of humanity’s thoughts and feelings. It is a mirror. The biases that we see there are our own. We need to look deeply into that mirror, and if we don’t like what we see, we need to change ourselves, not just adjust the mirror so it shows us a more pleasing picture!

To be sure, we don’t want AI models to be spouting hatred and misinformation, but simply fixing the output is insufficient. We have to reconsider the input—both in the training data and in the prompting. The quest for effective AI governance is an opportunity to interrogate our values and to remake our society in line with the values we choose. The design of an AI that will not destroy us may be the very thing that saves us in the end.

Categories: Technology

You Can’t Regulate What You Don’t Understand

O'Reilly Radar - Thu, 2023/06/15 - 12:25

The world changed on November 30, 2022 as surely as it did on August 12, 1908 when the first Model T left the Ford assembly line. That was the date when OpenAI released ChatGPT, the day that AI emerged from research labs into an unsuspecting world. Within two months, ChatGPT had over a hundred million users—faster adoption than any technology in history.

The hand wringing soon began. Most notably, The Future of Life Institute published an open letter calling for an immediate pause in advanced AI research, asking: “Should we let machines flood our information channels with propaganda and untruth? Should we automate away all the jobs, including the fulfilling ones? Should we develop nonhuman minds that might eventually outnumber, outsmart, obsolete and replace us? Should we risk loss of control of our civilization?”

In response, the Association for the Advancement of Artificial Intelligence published its own letter citing the many positive differences that AI is already making in our lives and noting existing efforts to improve AI safety and to understand its impacts. Indeed, there are important ongoing gatherings about AI regulation like the Partnership on AI’s recent convening on Responsible Generative AI, which happened just this past week. The UK has already announced its intention to regulate AI, albeit with a light, “pro-innovation” touch. In the US, Senate Minority Leader Charles Schumer has announced plans to introduce “a framework that outlines a new regulatory regime” for AI. The EU is sure to follow, in the worst case leading to a patchwork of conflicting regulations.

All of these efforts reflect the general consensus that regulations should address issues like data privacy and ownership, bias and fairness, transparency, accountability, and standards. OpenAI’s own AI safety and responsibility guidelines cite those same goals, but in addition call out what many people consider the central, most general question: how do we align AI-based decisions with human values? They write:

“AI systems are becoming a part of everyday life. The key is to ensure that these machines are aligned with human intentions and values.”

But whose human values? Those of the benevolent idealists that most AI critics aspire to be? Those of a public company bound to put shareholder value ahead of customers, suppliers, and society as a whole? Those of criminals or rogue states bent on causing harm to others? Those of someone well meaning who, like Aladdin, expresses an ill-considered wish to an all-powerful AI genie?

There is no simple way to solve the alignment problem. But alignment will be impossible without robust institutions for disclosure and auditing. If we want prosocial outcomes, we need to design and report on the metrics that explicitly aim for those outcomes and measure the extent to which they have been achieved. That is a crucial first step, and we should take it immediately. These systems are still very much under human control. For now, at least, they do what they are told, and when the results don’t match expectations, their training is quickly improved. What we need to know is what they are being told.

What should be disclosed? There is an important lesson for both companies and regulators in the rules by which corporations—which science-fiction writer Charlie Stross has memorably called “slow AIs”—are regulated. One way we hold companies accountable is by requiring them to share their financial results compliant with Generally Accepted Accounting Principles or the International Financial Reporting Standards. If every company had a different way of reporting its finances, it would be impossible to regulate them.

Today, we have dozens of organizations that publish AI principles, but they provide little detailed guidance. They all say things like  “Maintain user privacy” and “Avoid unfair bias” but they don’t say exactly under what circumstances companies gather facial images from surveillance cameras, and what they do if there is a disparity in accuracy by skin color. Today, when disclosures happen, they are haphazard and inconsistent, sometimes appearing in research papers, sometimes in earnings calls, and sometimes from whistleblowers. It is almost impossible to compare what is being done now with what was done in the past or what might be done in the future. Companies cite user privacy concerns, trade secrets, the complexity of the system, and various other reasons for limiting disclosures. Instead, they provide only general assurances about their commitment to safe and responsible AI. This is unacceptable.

Imagine, for a moment, if the standards that guide financial reporting simply said that companies must accurately reflect their true financial condition without specifying in detail what that reporting must cover and what “true financial condition” means. Instead, independent standards bodies such as the Financial Accounting Standards Board, which created and oversees GAAP, specify those things in excruciating detail. Regulatory agencies such as the Securities and Exchange Commission then require public companies to file reports according to GAAP, and auditing firms are hired to review and attest to the accuracy of those reports.

So too with AI safety. What we need is something equivalent to GAAP for AI and algorithmic systems more generally. Might we call it the Generally Accepted AI Principles? We need an independent standards body to oversee the standards, regulatory agencies equivalent to the SEC and ESMA to enforce them, and an ecosystem of auditors that is empowered to dig in and make sure that companies and their products are making accurate disclosures.

But if we are to create GAAP for AI, there is a lesson to be learned from the evolution of GAAP itself. The systems of accounting that we take for granted today and use to hold companies accountable were originally developed by medieval merchants for their own use. They were not imposed from without, but were adopted because they allowed merchants to track and manage their own trading ventures. They are universally used by businesses today for the same reason.

So, what better place to start with developing regulations for AI than with the management and control frameworks used by the companies that are developing and deploying advanced AI systems?

The creators of generative AI systems and Large Language Models already have tools for monitoring, modifying, and optimizing them. Techniques such as RLHF (“Reinforcement Learning from Human Feedback”) are used to train models to avoid bias, hate speech, and other forms of bad behavior. The companies are collecting massive amounts of data on how people use these systems. And they are stress testing and “red teaming” them to uncover vulnerabilities. They are post-processing the output, building safety layers, and have begun to harden their systems against “adversarial prompting” and other attempts to subvert the controls they have put in place. But exactly how this stress testing, post processing, and hardening works—or doesn’t—is mostly invisible to regulators.

Regulators should start by formalizing and requiring detailed disclosure about the measurement and control methods already used by those developing and operating advanced AI systems.

In the absence of operational detail from those who actually create and manage advanced AI systems, we run the risk that regulators and advocacy groups  “hallucinate” much like Large Language Models do, and fill the gaps in their knowledge with seemingly plausible but impractical ideas.

Companies creating advanced AI should work together to formulate a comprehensive set of operating metrics that can be reported regularly and consistently to regulators and the public, as well as a process for updating those metrics as new best practices emerge.

What we need is an ongoing process by which the creators of AI models fully, regularly, and consistently disclose the metrics that they themselves use to manage and improve their services and to prohibit misuse. Then, as best practices are developed, we need regulators to formalize and require them, much as accounting regulations have formalized  the tools that companies already used to manage, control, and improve their finances. It’s not always comfortable to disclose your numbers, but mandated disclosures have proven to be a powerful tool for making sure that companies are actually following best practices.

It is in the interests of the companies developing advanced AI to disclose the methods by which they control AI and the metrics they use to measure success, and to work with their peers on standards for this disclosure. Like the regular financial reporting required of corporations, this reporting must be regular and consistent. But unlike financial disclosures, which are generally mandated only for publicly traded companies, we likely need AI disclosure requirements to apply to much smaller companies as well.

Disclosures should not be limited to the quarterly and annual reports required in finance. For example, AI safety researcher Heather Frase has argued that “a public ledger should be created to report incidents arising from large language models, similar to cyber security or consumer fraud reporting systems.” There should also be dynamic information sharing such as is found in anti-spam systems.

It might also be worthwhile to enable testing by an outside lab to confirm that best practices are being met and what to do when they are not. One interesting historical parallel for product testing may be found in the certification of fire safety and electrical devices by an outside non-profit auditor, Underwriter’s Laboratory. UL certification is not required, but it is widely adopted because it increases consumer trust.

This is not to say that there may not be regulatory imperatives for cutting-edge AI technologies that are outside the existing management frameworks for these systems. Some systems and use cases are riskier than others. National security considerations are a good example. Especially with small LLMs that can be run on a laptop, there is a risk of an irreversible and uncontrollable proliferation of technologies that are still poorly understood. This is what Jeff Bezos has referred to as a “one way door,” a decision that, once made, is very hard to undo. One way decisions require far deeper consideration, and may require regulation from without that runs ahead of existing industry practices.

Furthermore, as Peter Norvig of the Stanford Institute for Human Centered AI noted in a review of a draft of this piece, “We think of ‘Human-Centered AI’ as having three spheres: the user (e.g., for a release-on-bail recommendation system, the user is the judge); the stakeholders (e.g., the accused and their family, plus the victim and family of past or potential future crime); the society at large (e.g. as affected by mass incarceration).”

Princeton computer science professor Arvind Narayanan has noted that these systemic harms to society that transcend the harms to individuals require a much longer term view and broader schemes of measurement than those typically carried out inside corporations. But despite the prognostications of groups such as the Future of Life Institute, which penned the AI Pause letter, it is usually difficult to anticipate these harms in advance. Would an “assembly line pause” in 1908 have led us to anticipate the massive social changes that 20th century industrial production was about to unleash on the world? Would such a pause have made us better or worse off?

Given the radical uncertainty about the progress and impact of AI, we are better served by mandating transparency and building institutions for enforcing accountability than we are in trying to head off every imagined particular harm.

We shouldn’t wait to regulate these systems until they have run amok. But nor should regulators overreact to AI alarmism in the press. Regulations should first focus on disclosure of current monitoring and best practices. In that way, companies, regulators, and guardians of the public interest can learn together how these systems work, how best they can be managed, and what the systemic risks really might be.

Categories: Technology

ChatGPT, Now with Plugins

O'Reilly Radar - Tue, 2023/06/13 - 03:30

A few months ago, I wrote about some experiments with prime numbers. I generated a 16-digit non-prime number by multiplying two 8-digit prime numbers, and asked ChatGPT (using GPT -3.5) whether the larger number was prime. It answered correctly that the number was non-prime, but when it told me the number’s prime factors, it was clearly wrong. It also generated a short program that implemented the widely used Miller-Rabin primality test. After fixing some obvious errors, I ran the program–and while it told me (correctly) that my number was non-prime, when compared to a known good implementation of Miller-Rabin, ChatGPT’s code made many mistakes. When it became available, GPT-4 gave me similar results. And the result itself–well, that could have been a good guess. There’s a roughly a 97% chance that a randomly chosen 16-digit number will be non-prime.

OpenAI recently opened their long-awaited Plugins feature to users of ChatGPT Plus (the paid version) using the GPT-4 model. One of the first plugins was from Wolfram, the makers of Mathematica and Wolfram Alpha. I had to try this! Specifically, I was compelled to re-try my prime test. And everything worked: ChatGPT sent the problem to Wolfram, it determined that number was not prime, and gave me the correct prime factors. It didn’t generate any code, but provided a link to the Wolfram Alpha result page that described how to test for primality. The process of going through ChatGPT to Wolfram and back was also painfully slow, much slower than using Wolfram Alpha directly or writing a few lines of Python. But it worked and, for fans of prime numbers, that’s a plus.

I was still uncomfortable. How does ChatGPT decide what to offload to Wolfram Alpha, and what to handle on its own? I tried a few questions from calculus; unsurprisingly, they went to Wolfram. Then I got really simple: “How much is 3 + 5?”  No Wolfram, and I wasn’t surprised when ChatGPT told me the answer was 8. But that begged the question: what about more complex arithmetic? So I asked “How much is 123456789 + 98776543321?”, a problem that could be solved by any elementary school student who has learned how to carry. Again, no Wolfram, but this time, the answer was incorrect.

We’ve long known that ChatGPT was poor at arithmetic, in addition to being poor at more advanced math. The Wolfram plugin solves the math problem with ease. However, ChatGPT is still poor at arithmetic, and still attempts to do arithmetic on its own. The important question that I can’t answer is “when does a problem become complex enough to send to the plugin?” The plugin is a big win, but not an unqualified one.

ChatGPT’s tendency to make up citations is another well-known problem. A few weeks ago, a story circulated about a lawyer who used ChatGPT to write a brief. ChatGPT cited a lot of case law, but made up all the citations. When a judge asked him to produce the actual case law, the lawyer went back to ChatGPT–which obediently made up the cases themselves. The judge was not pleased. That raises another question: ChatGPT has always been prone to making up citations–but now there’s a plugin for that! The ScholarAI plugin searches academic databases for citations, and returns links. That wouldn’t have helped this lawyer (I don’t yet see plugins from Westlaw or LexisNexis), but it’s worth asking: what about citations?

I first tried asking a medical question. I’m not a doctor, so the question was simple: what’s the latest research on antibiotic-resistant bacteria? ChatGPT sent the question to ScholarAI, and I got back a long list of relevant citations. (The plugin appeared to get into a loop, so I eventually terminated the output.) While I’m not competent to evaluate the quality or relevance of the papers, all the links were valid: the papers were real, and the author names were correct. No hallucinations here.

I followed up with some questions about English literature (I have a PhD, so I can make up real questions). I didn’t get as many citations in return, possibly because we don’t have preprint servers like ArXiv, and have done little to protest journals’ proprietary lock on scholarship. However, the citations I got were valid: real books and articles, with the authors listed correctly.

That begged another question, though. A list of articles is certainly useful, but you still have to read them all to write the paper. Could ChatGPT write an essay for me?  I asked it to write about colonialism in the work of Salman Rushdie, and got a passable short essay. It is what I’d call a “lazy” prompt: what I’d expect from a student who was interested in getting out of work, rather than using the AI to learn. There were citations, and they were real; ChatGPT didn’t link to the publications cited, but Google made it easy to find them. The resulting essay didn’t demonstrate any familiarity with the articles beyond the abstract–fair enough, since for most of the sources, the abstract was all that was publicly available. More to the point, the article didn’t really make any connections to Rushdie’s fiction. There were many sentences like this: “Hamish Dalley discusses the role of the historical novel in postcolonial writing, a genre to which many of Rushdie’s works belong.” True, but that doesn’t say much about either Rushdie’s work or Dalley’s. As I said, the essay was passable, but if I had to grade it, the student who turned it in wouldn’t have been happy. Still, ChatGPT and ScholarAI get credit for doing a decent literature search that could be the basis for an excellent paper. And if a student took this initial prompt, read the academic articles along with Rushdie’s novels, and used that to write a more detailed prompt telling ChatGPT exactly what points he wanted to make, with relevant quotations, the result could have been excellent. An essay isn’t an exercise in providing N*1000 words; it’s the outcome of a thought process that involves engaging with the subject matter. If ChatGPT and ScholarAI facilitate that engagement, I wouldn’t object. But let’s be clear: regardless of who generates the words, ChatGPT’s users still have to do the reading and thinking.

As with the Wolfram plugin, it’s helpful to understand when ChatGPT is using ScholarAI, and when it isn’t. I asked ChatGPT to find articles by me; when using the plugin, it couldn’t find any, although it apologetically gave me a list of articles whose authors had the first name Michael. The sad list of Michael-authored articles notwithstanding, I’ll count that response as “correct.” I haven’t published any academic papers, though I have published a lot on O’Reilly Radar–material that any web search can find, without the need for AI or the risk of hallucination.

If you dig a bit deeper, the results are puzzling. If you use ChatGPT with plugins enabled and write a prompt that tells it not to use the plugin, it comes up empty, but suggests that you research online databases like Google Scholar. If you start a new conversation and do not enable plugins (plugins can only be enabled or disabled at the start of a conversation), you still get nothing–but ChatGPT does tell you that Michael Loukides is a well-known author who has frequently written for O’Reilly, and to check on the O’Reilly website for articles. (It isn’t clear whether these different responses have to do with the state of the plugin, or the way ChatGPT randomizes its output.) Flattery will get you somewhere, I suppose, but not very far. My publication history with O’Reilly goes back to the 1990s, and is all public; it’s not clear why ChatGPT is unaware of it. Starting a new conversation with Bing searches enabled got me a list of valid links to articles that I’ve written–but I shouldn’t have had to try three times, the process was much slower than searching with Bing (or Google) directly, and it wasn’t clear why some articles were included and some weren’t. And you really do have to try multiple times: you can’t use both Bing searches and plugins in the same conversation.

As with the Wolfram plugin, ScholarAI is a big improvement–but again, not an unqualified one. You still have to know whether the content you’re looking for is in an academic journal, on the web, or somewhere else. While ChatGPT tells you when it is using a plugin, and which plugin it is using, you can’t always predict what it will do in advance–and when it doesn’t use a plugin, ChatGPT is vulnerable to the same errors we’ve come to expect. You still have to experiment, and you still have to check the results.

As another test, I used the Kayak plugin to check out flights for some trips I might take. The plugin does a good job with major airports (including smaller ones), though it seemed to be hit-or-miss with very small airports, like New Haven (HVN). That’s a limitation of Kayak, rather than the plugin itself or ChatGPT. You currently have to enable the plugins you’re going to use at the start of each conversation, and ChatGPT doesn’t allow you to enable competing plugins. You can install both Kayak and Expedia, but you can only use one in any chat. I wouldn’t be surprised if this behavior changes as plugins mature.

Finally: all the plugins I installed were free of charge. However, I don’t think it’s called the “plugin store” for nothing. It wouldn’t surprise me to see charges for plugins, and I would be surprised if some plugins eventually require a subscription to a paid account. A number of the plugins access subscription-based services; I expect that subscriptions will be required once we are out of the Beta period.

I’m excited that plugins have finally arrived. Plugins are still in beta, so their behavior will almost certainly change; the behaviors I’ve described may have changed by the time you read this. Several changed while I was writing this article. Plugins certainly don’t eliminate the need to be careful about hallucinations and other kinds of errors, nor do they replace the need for thinking. But it’s hard to understate how important it is that ChatGPT can now reach out and access current data. When ChatGPT was limited to data before November 2021, it was an intriguing toy. It’s looking more and more like a tool.

Categories: Technology

Radar Trends to Watch: June 2023

O'Reilly Radar - Tue, 2023/06/06 - 02:53

AI, and specifically large language models, continue to dominate the news–so much so that it’s no longer a well-defined topic, with clear boundaries. AI has infiltrated programming, security, and virtually every branch of technology.

But that’s hardly news. By the time you read this, Apple may have announced their ultra-expensive, ultra-stylish AR headset; that may be enough to break the Metaverse out of its AR/VR winter. Or maybe not. If Apple can’t make technology into a fashion statement, no one can. And Rust has forked, spawning a new programming language called Crab. Will this slow Rust’s momentum? It is very hard to say. Many projects have forked and few forks thrive, but there are exceptions.

AI
  • LMSYS ORG (Large Model Systems Organization), a research cooperative between Berkeley, UCSD, and CMU, has released ELO ratings of large language models, based on a competitive analysis. Not surprisingly, GPT 4 is the leader. PaLM 2 is included, but not the larger LLaMA models.
  • OpenAI has added plug-ins (including web search) to its ChatGPT Plus product. Unsurprisingly, Microsoft is adding plugins to its AI services, using the same API.
  • A new AI stack is emerging, using LLMs as endpoints and vector stores for local data. To answer a query, relevant data is found in the vector store and used to build a prompt for the LLM.
  • TechTalks has a good explanation of LoRA (Low Rank Adaptation), a technique for fine-tuning large language models that is less time- and processor-intensive.
  • Langchain: The Missing Manual has been published online by the makers of PineconeDB, a vector database that is frequently used with Langchain to build complex applications on top of  large language models. Chapters are being added as they are completed.
  • The importance of user interface design for AI has never been properly appreciated. Fitting a language model into a chatbot is easy, and made it possible for millions to use them. But chatbots aren’t really a good user interface.
  • Vector databases are a relatively new kind of database that work well with large language models and other AI systems. They can be used to augment a model’s “knowledge” by adding additional documents.
  • Google has announced Codey, a code generation model similar to Codex. Codey will be accessible through Visual Studio, Jet Brains, and other IDEs, in addition to Google Cloud products such as Vertex. They have also announced new models for image and music generation. These models are currently in restricted beta.
  • Mosaic has released MPT-7B, an open-source family of large language models that allows commercial use. There are three variants of the base model that have been specialized for chat, writing long stories, and generating instruction. MPT-7B demonstrates the MosaicML platform, a commercial service for training language models.
  • Now that so many people are using APIs and tools like AutoGPT to build applications on top of AI models, Simon Willison’s explanation of prompt injection, which shows why it is an attack against the applications rather than the models themselves, is a must-read (or see).
  • OpenLLaMA is yet another language model based on Facebook’s LLaMA. OpenLLaMA is completely open source; it was trained on the open source RedPajama dataset, allowing it to avoid the licensing restrictions attached to LLaMA and its descendants.
  • A new study has shown that fMRI images of the brain can be used to decode sentences that the person was hearing. This is the first time that noninvasive techniques have succeeded in decoding linguistic activity.
  • It had to happen. Chirper is a social network for AI. No humans allowed. Though you can observe. And create your own chatbots.
  • MLC LLM, from developers of Web LLM, allows many different combinations of hardware and operating systems to run small large language models entirely locally. It supports iPhones, Windows, Linux, MacOS, and web browsers.
  • DeepFloyd IF is a new generative art (text-to-image) model developed by Stability.AI. It is available from HuggingFace.
  • Lamini is a service for customizing large language models. They support multiple foundation models, provide a data generator, and have APIs for prompt-tuning and RLHF.
  • Edward Tian, cofounder of GPTZero, has said that GPTZero will never be about detecting plagiarism. Their goal is understanding and supporting students who are learning how to use these models.
  • The next step in making AI more trustworthy might be developing language models that respond to prompts by asking questions first, rather than generating answers. Doing so encourages human users to think critically, rather than simply accepting the AI’s output.
Data
  • OpenSafely is an open source platform that allows researchers to access electronic health records securely and transparently. The records never leave the repositories in which they are held. All activity on the platform is logged, and all projects are visible to the public.
  • The Solid project is developing a specification for decentralized data storage. Data is stored in pods, which are analogous to secure personal web servers.
  • The Kinetica database has integrated natural language queries with ChatGPT. We will see many more products like this.
Programming
  • We expect many companies to follow Honeycomb by using ChatGPT to incorporate natural language queries into their user interface. However, the path to doing so isn’t as straightforward or simple as you might think. What are the problems nobody talks about?
  • I have been avoiding all the Rust drama. But sometimes drama is unavoidable. Rust has been forked; the new language is named Crab; and we will all see what the future holds.
  • Can you write Python like it’s Rust? While Python will always be Python, some Rust-like practices will make your code more safe.
  • To improve software supply chain security, the Python Package Index (PyPI), which is the registry for open source Python packages, now requires two factor authentication from all publishers. PyPI has been plagued with malware submissions, account takeovers, and other security issues.
  • It’s worth taking a look at the map of GitHub. Is your favorite project in GPTNation? Or JavaLandia? Or Gamedonia? If you zoom in, you can see how individual projects cluster, along with the connections between them.
  • Julia Evans’ (@b0rk’s) guide to implementing DNS in a weekend is a must-read for anyone who wants to understand network programming at a low level.
  • Codon is a new Python compiler that generates code with much higher performance than interpreted Python (CPython). It doesn’t implement all of Python 10’s features, and it was designed specifically for bioinformatics workloads, so it may not perform well in other applications. Still, if Python performance is an issue, it’s worth trying.
  • GitHub Code Search is finally out of beta and available to the general public. Code Search is not AI; it’s a traditional search, with regular expressions, across all of GitHub. By itself, that is extremely powerful.
  • GitLab has partnered with Google to add AI features to their platform. This includes a facility to detect and explain vulnerabilities, in addition to the ability to customize and build upon Google’s foundation models.
  • Another new programming language? Mojo is very closely related to Python (the language syntax is identical to Python), but it’s a compiled language that’s designed for high performance.
  • Learn Python with Jupyter represents an interactive approach to learning Python. Additional chapters are being released every few weeks.
  • It’s not just Linux. The Windows 11 kernel will soon include code written in Rust.
  • The Prossimo project is improving memory safety on the Internet by rewriting important Unix/Linux infrastructure components using Rust. Their latest initiative is rewriting the ubiquitous superuser commands, sudo and su. NTP, DNS, and TLS are also on the list.
Security
  • Data poisoning is an effective attack on large language models. And, given that future search engines will be based on LLMs, black hat SEO will be targeting those models. It is far from clear that OpenAI, Google, and Microsoft have any effective defense against these attacks.
  • Amazon has open sourced two security tools developed for AWS: Cedar and Snapchange. Cedar is a language and API that allows users to write and enforce policy permissions. Snapchange is still experimental; it uses fuzzing to help find vulnerabilities in software.
  • Microsoft’s cloud services are cracking password protection on .zip files (a relatively easy task) to scan the contents for malware. Privacy issues aside, this is a problem for legitimate security researchers.
  • The FBI was able to use a vulnerability in Russia’s widespread Snake/Uroburos malware to disable it. This story is fascinating. Even exploits have exploits.
  • This year, the hacking village at DEF CON will have language models from all the major players (OpenAI, Google, Microsoft, Stability AI, Microsoft, NVIDIA, and HuggingFace) for attendees to attack.
Web
  • Bad user interface design: is it a meme, a game, a joke, a contest, satire, or all of the above? Sometimes it’s just fun to see how bad a volume control you can make. And sometimes, that puts you back in touch with reality. Not everything needs to be reinvented. Here’s more.
  • Nodepad is a web application for brainstorming, note taking, and exploring ideas using large language models.
  • Google has announced that image search results will include information about the image’s source, where else it has appeared, and whether it is known to be generated by AI. Images generated by Google’s AI tools will include metadata stating the image’s origin. Other image publishers will display similar information.
  • The Pudding provides a taxonomy of dark patterns: unethical tricks that companies use to prevent you from canceling online subscriptions.
  • Bluesky has opened their Twitter-like social network for a private beta, and has attracted many users away from Twitter. Bluesky competes directly with Mastodon, and has led to Mastodon streamlining their signup process.
Operations AR/VR
  • Apple is expected to announce their long-awaited augmented reality headset at WWDC this week (maybe even before this piece publishes). That will be a “make it or break it” event for AR and VR; if anyone can make wearing a headset fashionable, cool, and expensive, Apple can.
  • Who needs a screen? The Spacetop is a new laptop that uses AR goggles instead of a screen; the screen appears to float in space in front of the user.
Quantum Computing
  • Researchers at the University of Chicago claim to have developed “noise-canceling qubits” that reduce the probability of error when reading a qubit’s state. If their prototype stands up to further testing, this could make building quantum computers that are capable of real work much easier.
  • Quantum computers need to scale. IBM has announced a project to build a 100,000 Qubit quantum computer within 10 years. Depending on error correction, this still probably isn’t large enough to do real work, but it’s getting close.
  • Do you want your own quantum computer? SpinQ has created a quantum processor based on MRI technology, which was used for the first quantum demonstrations. Their Gemini Mini has two qubits, fits on a desktop, and costs $8,700. A high-end device ($58,000) has a third qubit. These processors may be useful for experimentation, but are far too small for useful work.
Biology Hardware
  • Researchers have made an edible battery. It contains no toxic materials, unlike most battery technologies. Exactly why you would eat a battery is a mystery; they offer some use cases, of which the most plausible is smart implants.
  • Apple and Google have announced a proposed standard that would prevent the abuse of location tracking devices like AirTag.
Categories: Technology
Subscribe to LuftHans aggregator