The AI Tidal Wave has Arrived (2022)

08 Sept 2022

City Plaza at Dusk, generated by 'Stable' AI on NightCafe

I think of the emergence of AI a bit like a tidal wave: in the beginning, it just seems like the landscape is changing, and where there used to be water, now you can walk and marvel and at everything that's being uncovered as the water recedes.

The question of where the water went inevitably comes up, and is answered before folks are really prepared, with their homes and businesses destroyed by the huge amount of energy that the wave has unlocked.

In the case of AI, there have been several smaller waves that seem to be building momentum:

  • Prose: OpenAI's GPT-3 produces human-like prose from a simple (one-sentence) prompt. It's trained on a huge corpus of text from the internet.
  • Voice: Baidu's Deep Voice can reproduce a voice with incredible fidelity using ~5 minutes of audio (3 seconds per clip, 100 clips, as per Vice's article documenting their findings).
  • Code: GitHub's CoPilot integrates with Microsoft's Visual Studio Code editor (MS owns GitHub, so this is not surprising) to automatically insert code that does what the AI thinks the programmer is trying to program. It picks up on context like comments and the function name to complete the entire function. This is not a particularly reliable way to generate correct code, but gives the programmer a starting point that can accelerate development.
  • Images: Google's Imagen, OpenAI's DALL·E 2, and Stability.ai's Stable Diffusion are all diffusion-based models that generate images from a text prompt, using the same sort of interface that GPT-3 uses, except to generate images instead of prose.

For now, these seem like novelties, much the way water receding seems like a neat curiosity at first, before the wave. But the wave is coming. It's going to have a big impact across multiple dimensions. Here are my guesses.

Dramatically Lowered Costs for Creative Work

Economies grow around order-of-magnitude costs for generating necessary goods. When a good steadily decreases in price over time, it creates a huge engine for new economies to built on top of it. The biggest examples over my lifetime are the development of the personal computer and internet access. Both were luxuries when I was born, and both are not only commonplace now, but expected in many areas of the world.

Utopian Cityscape, created in 10 minutes or so

The cost of creating images has been steadily decreasing for hundreds of years. Years of training and excellent materials and assistance could yield a realistic picture of someone 500 years ago. Later, art supplies became less expensive, and could be taken up as a hobby, leading to more artwork. Then the camera came along, which took prices down an order of magnitude. A century or so later, digital cameras and photo editing took it down another order of magnitude. Today, 20 years after that, we can generate images by the dozen using only text prompts and an AI model, leading to another order of magnitude decrease in cost. A skilled artist could produce this in several hours, but it took my son and I about 10 minutes using Stable Diffusion. In the same way photography and CGI changed how we think about creating images, so will models like Stable Diffusion.

It's pretty awesome that this enables me to print artwork that I generate myself using the AI. AI-generated art is currently lower cost and less IP-encumbered than the human-created equivalent, and is likely to remain so. This creates lowers costs in industries that require art on-demand and opens the door to more artwork in places where it would previously be excluded due to budget (online publishing is a good example). But these shifts are a small part of the impact this technology will have, I think. The real feature of this technology is reach, making artistic tools available to more people than ever before, which in turn creates new business opportunities.

For example, the images I've included in this blog post were not generated using models on a computer I own, but rather using a service to render prompts with various models, allowing me to to chain a text-based image generation model with an img2img model that adds new elements. I then send that result to a ESRGAN-based model that upscales the most promising image (after I choose it). This service is a new kind of business that empowers regular folks to produce professional quality artwork for less than the cost of a Starbucks coffee. This is just the beginning of an incoming wave of businesses that will spring up around the ability to generate, manipulate, and enhance artwork on-demand, with results delivered in times measured in seconds. Open source equivalents exist, like neonsecret's stable-diffusion-webui that supports similar workflows, though I haven't yet used it myself.

The proliferation of art will likely drive a boost to printing services, since one significant use of artwork is printing it and displaying it. Not all artwork will be printed, though, and it would make sense to see digital greeting cards start to incorporate "on-demand" art for cards as well, giving notes a personal, unique touch they don't have today. In both cases, commissioning art seems like a luxury, but generating it in a semi-automated fashion seems much less so.

The models generating the art change over time, which suggests there may be a business opportunity in model management. Models today are "baked" when you get them, meaning that unlike the human mind, models don't learn as they are used. There will be a need to retrain on updated data to generate updated models, as well as provide archives of old versions of models. This in turn could create an industry around AI archiving. Niche sites could pop up that compare the results of various queries across AI models over time, perhaps providing insight into how changing content on the internet reflected societal change. This service will continue to be of use as models evolve to learn more incrementally, I think, as "snapshots" of the model at various intervals in time. It could be that certain versions are best at a specific niche, making services like these more than curiosities.

I think there is significant potential for social networks that incorporate art generation to spring up around particular themes. One example is RPG gamers using AI generated art in campaigns. It could be used to generate portraits, settings, dungeon tiles, and myriad miscellaneous objects. Correctness isn't of paramount importance in many applications, since the value comes from the theme of the artwork.

Changing Our Perception of Art

It could be argued that art is fundamentally a communication between the artist and the audience. In the case of art generated by an AI, who is the artist? What was the artist trying to communicate? Having an AI generate the art changes our relationship with it, and maybe challenges some to question whether something generated by a computer can be art.

I think these are interesting questions to explore, but in practice, it's clear that most folks like art even when there isn't a clear message from the artist...sometimes an image just looks cool. I think this is the area where AI-art shines, even though it will probably lead to a lot of hand-wringing in the art community.

This will lead to a shift in how we see art though, and we'll start to think less in terms of dictating exactly what things will look like, but rather start with an rough idea and iterate towards a "good enough" prompt, and then generate a few dozen images, selecting the best. This is a much more opportunistic approach to art than a commissioned painting or photograph. For many applications, I suspect this approach will provide great results at a tiny cost, even though it removes a huge amount of control over the output.

The End of on Era for Copyright

A Futuristic City at Night, generated using the 'Stable' model on NightCafe

I've studied copyright for most of my life as a hobby, and while I'm still far from having any expertise, it's been clear to me for some time that copyright is in some sense a deal with devil: we want to enable people to live by creating artistic work, but at the same time we seek to violate a law of nature by trying to restrict the expression of ideas. This is ultimately doomed to fail to some degree, and the balance is inevitably struck in the day-to-day spats between those claiming copyright rights over works that others put online. Whether something is "transformative" enough to warrant the claim becomes a question for the courts.

While the legal landscape is a scarred battlefield of disparate skirmishes, the day-to-day reality is much more stark: all of the AIs I've mentioned are trained on vast data sets scraped from the internet wholesale, regardless of copyright status. This includes everything from Github's CoPilot to Stable Diffusion. When we input a prompt to an instance of Stable Diffusion, or ask CoPilot to suggest code, the neural network we're running our prompt through has been shaped by the billions of images or billions of lines of code in the dataset it was trained on. Attributing AI output to particular training data would be akin to paying teachers a portion of the salary their students end up making due to that particular teacher's instruction. Over-monetizing is like over-fitting an ML model: it leads to poor outcomes.

Realistically, if we vastly lower the cost of producing something, those whose salaries contributed to the high costs will suffer a loss of income. This can seem unjust, but it lies precisely at the center of what makes the technology so powerful. Since this takes away jobs for humans, there will always be a temptation to fight against the tide of automation. The Financial Times (archived link due to paywalls) recently reported on increased concern among voice actors about voice AIs that can replicate their voice cheaply. The UK is attempting to address this by changing copyright law so that AI recreations of an actor's voice will generate income for the actor being emulated by the AI.

This is likely a mistake, as copyright is about the rights of the creator, but an actor ceases to be the creator if the voice performance is generated by an AI without the actor's involvement. The result is based off of the actor's work certainly, in the same way a software system written by a programmer continues to generate revenue for their employer after they stop working on the software, and even after they have left the company. Software engineers are paid a salary for their time working, not on the basis of how much revenue is generated from a particular piece of code they wrote. This works for software engineers because there is always more work to do in software, so retaining engineers on staff is common. A voice actor, by comparison, can contribute their voice to model training, but maybe don't have continued voice work to do that justifies a salary (I'm making this up, actually...I can imagine a business that hires voice actors full time and generates voice models for a wide variety of uses, offers old versions, and updates continually to improve their quality).

The common way to address this is to simply jump to royalties, but that replaces the problem of how to pay artists with another problem: we now need to define what kind of intellectual property should cover the likeness of someone's voice. What about cases where the voice can be synthesized without any work on the part of the artist? Are they entitled to money earned by others that replicated it? These questions are certainly not covered by trademark, copyright, or patents. Given the amount of effort that went into creating the existing systems (literal decades of work went into the revisions to copyright in the 1970s), it's a fair bet that this won't get resolved soon.

Which provides a nice segue to an interesting question: what is the copyright status of an image generated by one of these AIs? While the internet has been antithetical to copyright since the beginning, the emergence of AIs that are trained on all available data (regardless of copyright status), has significantly altered the landscape in ways that violate the assumptions that underpin copyright. Specifically, copyright law little notion of artistic works that are both transformative and produced mechanically. The only case law I'm aware of is Authors Guild, Inc. vs Google Inc. (2013). While there were a lot of twists, in the end it was quite clear that Google scanning 20M books and OCR'ing them for public use in search constituted a benefit for all of society. I can see similar arguments being made in support of Stable Diffusion being trained on copyrighted data.

But there are still challenges. Since the training set can contain data under any copyright status, and because AIs can regurgitate data from their training, this leads to uncomfortable situations where the AI appears to simply plagiarize from the training data. While the model is specifically trained to avoid producing output from the training set, it's still fundamentally statistical. There will be exceptions. This suggests that we have not yet converged on an agreement about how intellectual property intersects artificial intelligence, either in terms of training or output.

Pausing to think about it, it's not particularly surprising that, with the development of increasing sophisticated artificial brains, questions start to arise about how we chose to monetize creative work. In particular, the assumption that creative works are generated by people seems to be eroding. Combined with the internet's strength at copying data instantly for almost-free, the foundations copyright was built on are crumbling. We need to rethink the system.

It's Just the Beginning

We've barely gotten started. These models have all kinds of weaknesses, many of which seem completely surmountable.

Orcish Monuments in the Countryside, generated by 'Stable' AI on NightCafe

  • Persistent characters: it's not hard to imagine an AI trained on a vast data set, and then augmented with domain-specific understanding of a specific character or setting. For example, an AI specifically trained on every Spiderman comic ever drawn that's able to replicate the look of that character at various times across the decades.
  • Motion: img2img algorithms exist to enhance images, and some models (like coherence on NightCafe) can do motion, but most models still generate one-off images. The ability to retain state between frames to generate smooth video will unlock lots of new possibilities, like a version of TikTok that doesn't just choose what content to show based on your viewing, but generates content on the fly based on what you've spent time watching.
  • Easy Orchestration: the explosion of models specializing in various areas will lead to a need to orchestrate them into a pipeline for art creation. One can imagine having one AI to generate a setting, another to enhance with characters, another to generate keyframes, another to interpolate frames between them, and another to upscale everything.
  • Improved prompts: today, getting realistic results is a bit of a dark art, requiring strange phrases like "trending on artstation, 8 k, dramatic lighting, ultra detailed". These are due to the quirks of how the model was trained, and will likely considered an artifact from a more primitive time in the future.

I don't think any of these improvements are even years away. Persistent characters seems to most challenging, but even that could emerge by late 2023, I think. Things are moving quickly.