You’ve almost certainly already seen a torrent of high-quality text-to-image generation pics lately. Now one of the most impressive generative tools has published code and opened up to academic researchers. It’s called Stable Diffusion – and you’ll probably be hearing a lot more soon.

The launch

Emad Mostaque from Stability AI has the news – no doubt timed to this month’s SIGGRAPH event:

Stable Diffusion launch announcement

And then it’s now live on GitHub (though you’ll need to submit academic institutional credentials in order to gain access to checkpoints, if you aren’t already in the now-closed-up beta):

https://github.com/CompVis/stable-diffusion

Those of you who have already been following this one probably jumped straight through to the blog and GitHub repository, but here’s the short version for everyone else:

Recipe for Stable Diffusion

First, you start with some of the leading-edge research into high-resolution image synthesis using latent diffusion models. The basic notion there is, someone else takes care of the machine learning “training” and dataset – and the massive amounts of GPU power consumption that involves – and you just type in text and get whole images back, quickly and without much computation.

This particular version of the model is creating images that are stunningly coherent – and that in turn has come out of a lot of human collaboration and a lot of data science (both in terms of the research and the scale of the data being used). As they explain:

The model itself builds upon the work of the team at CompVis and Runway in their widely used latent diffusion model combined with insights from the conditional diffusion models by our lead generative AI developer Katherine Crowson, Dall-E 2 by Open AI, Imagen by Google Brain and many others.

There’s also big data used to weight the output of the big data – much as social media currently works. So this particular CLIP model “filtered LAION-5B based on how “beautiful” an image was, building on ratings from the alpha testers of stable diffusion.”

Humans, machines, big data, and openness

That’s a big deal. If these look really good, it’s partly because humans are choosing what looks good. Note that that could also shift as tastes change over time – any old-timers remember when Flickr was littered with HDR images? Or check this article from 2006. Part of what we’re seeing in machine learning is an extension of what we’re seeing in social media – large groups of humans deciding collectively what it and isn’t valuable, for better and for worse.

How big is that task? Well:

The model was trained on our 4,000 A100 Ezra-1 AI ultracluster over the last month as the first of a series of models exploring this and other approaches.

We have been testing the model at scale with over 10,000 beta testers that are creating 1.7 million images a day.

To me, these limitations are fascinating, and I think they can deepen appreciation for what you’re looking at and how it might be used. So this isn’t necessarily critical, though it could lead to some criticism.

The flipside of centralizing computation, too, is that your local computer doesn’t have to do much work – a consumer GPU with less than 10 GB VRAM can make a 512×512 image in seconds, the creators say.

Unlike some things that call themselves “open,” the LAION project on which this is based has a mix of members in a non-profit, entirely funded by donations and public research grants. So the ultimate goal they say is “aiming to make large-scale machine learning models, datasets and related code available to the general public.”

But what are these images?

It’s important to understand that the AI is not “creating” images, despite the “generative” label. It’s more accurate to say that what you’re looking at, convincing as it is, is emergent from patterns in the data, guided by your text prompt. That might sound the same, but it isn’t. It says a lot about the content of the data set and the particular way these convolutional techniques navigate it based on text that you see what you see.

It’s even important not to retroactively apply the way machine learning works to your brain. So no, absorbing images and then spitting them out like this is probably not quite how your brain works. That’s happened a lot in tech, which is why past centuries start to imagine brains like clocks (no, not at all) or computers (still no). To really delve into that and how to understand the way the brain works with images, we should get a conversation going with neurologists and psychologists and maybe an anthropologist. I’m of less use as a musicologist and media art historian or whatever I am but … well, I could probably get them talking.

I do expect we’ll see a lot of these images, and that it will disrupt businesses around stock imagery and illustration. That disruption could create some opportunities – and some pain. I’d have to be really technologically optimistic to imagine it’s all good and “progress.” We could also see so much of this sort of imagery that eventually there’s a counter-reaction and… I’m getting ahead of myself. But this will be one to discuss, including, as it happens, the panel I’m moderating at MUTEK later this month if you’re around in Montreal.

As impressive as the generated images are, it’s also likely that these techniques will be applied to other contexts, as well – like image repair, upscaling, 3D applications, and some really wild possibilities for editing and whatnot.

And yes, I’ve tagged this under ‘motion’ partly because motion artists are deeply involved in exploring machine learning, but also because no doubt we’ll see moving applications of this technique soon, too.

I’m sure we’ll be talking about this, so feel free to argue with me on this – and I’m curious how folks are using these tools artistically (or not), research you’re doing, and how you’re thinking about them.

All images published on GitHub using Stable Diffusion.