(This is the second in a post series that starts here)

My story with Image Generation starts with DALL-E, and so I will start there. I then cover Stable Diffusion and Midjourney before heading into some thoughts — It’s hard to call what I have a conclusion, since I feel so utterly inconclusive about this technology. (Note: Many of the galleries below have captions and commentary)

DALL-E 2

A painting of single poplar tree in fall with leaves falling, lit just before golden hour, that evokes feelings of nostalgia and warmth. This was the prompt that gave me my first result from DALL-E that made me go "oh shit."

It's not a perfect painting; there's certainly some oddities… but this looked way better than it had any right to be.

How did I get here?

I was creating slides for my CMPUT 229 class, and I was discussing one of my favourite assembly mnemonics, eieio, which always puts the song "Old Macdonald" in my head. The slide was a bit barren, so I thought, it would be nice to have some art for this. I'd just been reading a bit about DALL-E, and so I signed up for an account, and after a bit of trial and error had an image I could use for my class.

“A fine art painting of a farmer holding a microprocessor"'

The experience of playing with DALL-E was interesting. The prompts they display on the front page are often very simple things, producing surprisingly coherent results. In reality, excellent results seem to take a bit more effort than the simple prompts they propose — that or this is a question of luck, and access to many many generations for the same prompt.

DALL-E's advertised ""A sea otter with a pearl earring" by Johannes Vermeer"

This is what they show on the front page

Candidate 1

Trying the same prompt myself, I got four candidates.

Candidate 1: Doesn’t even seem to be an otter, and no sign of an earring.

Candidate 2

Doesn’t even seem to be an otter, and no sign of an earring.

Candidate 3

Still not an otter… like a cross between a cat and a stoat?

Candidate 4

Maybe the most otter like… but still a failure.

DALL-E intrigued me heavily, so I played with it, up to the limit provided by their free credits. If you’re even remotely interested in this stuff, I’d encourage you to play with this as well. Even if you find the whole idea viscerally upsetting, it’s worth playing to figure out the strengths and weaknesses — and to be sure, there are weaknesses.

"A John Constable Painting of the Starship Enterprise over the hills of tuscany"

I don’t really get a Tuscan vibe here, and it’s only the loosest interpretation of the enterprise. I’m not enough of an art history major to rate the style, but first impression is that it hasn’t quite got it.

"Vincent Van Gogh paints Mammatus clouds over stonegenge'

A clear flub on pretty much all dimensions.

"ansel adams photo of the moon rising over a city skyline, wide angle long distance shot high contrast dramatic "

The moon looking like a chunk of sausage is terribly weird. All the generated images for this batch were wrong.

A Streamliner Modern apartment building, line drawing, sketch

Apparently DALL-E has no idea what Streamliner Modern(e) (tried both spellings) looks like.

"A black and white pencil drawing of a computer with arms using a fountain pen to write a haiku, simple art style suitable for printing, clip art"

Pressuring the system with improbable prompts produced poor results.

"Two parents on a space station watching their child go to sleep on a baby monitor, digital art, "

Could make some real messed up looking people; this one is actually one of the more aesthetically pleasing of the 4 produced for this prompt.

"A marble bust of a woman with a hijab and fashionable sunglasses, studio lit, soft focus"

Prompt ambiguity is a real challenge; how to explain that I wanted the hijab and glasses to be marble too — not sure.

"Michelangelo's David lit by blue and red neon in a room with grass as flooring."

This ties back a bit to the painting of the Starship Enterprise; anything with a fixed geometry tends to produce poor results.

Of course, I opened this post being impressed: There certainly were a few results I found impressive. Even in failure, DALL-E often produced images that were nevertheless aesthetically pleasing (for example, I quite like the failed John Constable painting above).

"An annoyed bat eating ice cream by Norman Rockwell"

"an impressionist painting of a baby drinking from a bottle 1887"

This was actually my sister’s prompt, but the results were great.

"A man wearing a roman toga wandering around a vaporwave house with blue skies and greco-roman statuary in the house"

Unfortunately, the limited credits that came for free with DALL-E limited my ability to explore these systems. I sought out other choices, and the obvious next thing to explore was…

Stable Diffusion

Stable Diffusion is an image generation where the model has been released publicly; this has lead to a number of implementations of the algorithms and apps that have wrapped everything up making it possible to do local generation.

My experience with Stable Diffusion has largely been that the results are not quite up to par with what DALL-E can provide. Partially this is because the model is optimized for producing 512x512 images, where DALL-E does 1024x1024. But more generally I’ve found that prompts the produce lovely results in DALL-E don’t produce results nearly of the same quality with Stable Diffusion.

Having said that, the ability to iterate has been interesting. I’ve played with two wrappers around Stable Diffusion; DiffusionBee and Draw Things AI (very powerful, but I’m not going to lie, the interface is baffling), as well as a python library (the one that powers DiffusionBee I think?)

Perhaps the most interesting thing I’ve found with these tools is the ability to play with parameters. For example, you can use the randomness generation seed, but vary your prompt, to interesting effect:

"A man experiencing happiness surrounded by ghosts by Van Gogh"

Note that the prompt is only partially successful!

"A woman experiencing happiness surrounded by ghosts by Van Gogh"

This image uses the exact same generation parameters down to the seed, but I replaced the word “man” with “woman”

Notice how the composition mostly stays the same; this is side effect of the same starting seed. Using a command line version of Stable Diffusion, I have done a bit of larger scale experimentation with prompt changing while holding the seed still, producing some interesting effects

“Still life of hydrangeas, artist born around X”, for X in [1400, 2025] in 25 year increments…

Another interesting parameter exposed by these tools is the “guidance” parameter, which as I understand it controls how much the model tries to take your prompt into account. Using 0 (don’t care about my prompt) has produced some wild images:

Midjourney

Midjourney is hard for me to write about, because I don’t understand it. It’s extremely clear they’re doing something clever, as Midjourney can often produce the most remarkable images from the simplest of prompts. Take a brief look through the Midjourney showcase, or look at these (deservedly!) New York Times Feature Article worthy images. Yet I have no idea how or why it works the way it does. I also find it painful to explore, as the interface (at least for free users) is a very noisy set of hundreds of channels on Discord; nothing like experimenting in public.

An oil painting of a regency skateboarder

The silhouette of a cat filled with stars in front of the sun

"A painting of single poplar tree in fall with leaves falling, lit just before golden hour, that evokes feelings of nostalgia and warmth."

"A John Constable Painting of the Starship Enterprise over the fields of tuscany"

18th century painting still life vaporwave.

Despite the discomfort of working in public, it’s interesting to see what others produce. Some prompts are simple, some are complex, but I’m almost uniformly impressed by the results produced by Midjourney.

If I were an artist, Midjourney would be what scared me most — it’s clearly pulling modern styles from artists and reproducing them, sometimes with upsetting fidelity; showing Andrea the gallery and she said “it reminds me of my instagram feed”.

Someone described AI art as "discovery"; which does feel at least a bit apt; having said that, Midjourney has torqued itself incredibly to hit certain aesthetics with minimalist prompts.

Conclusions

It seems pretty clear that the ability to generate “good enough” art is going to have some very wide ranging impacts. As I said in my first post; the discussion of this is extremely challenging to separate from Capitalism. Some people are going to lose their jobs; more as these models get better. Will new jobs be created as a result? It seems to me that this is yet another automation that eliminates a class of jobs, making a smaller number of more valuable positions; another brick on the pedal of inequality.

I haven’t even touched on the questions of art and artistry here: Are the products of these systems art? Art prompt writers artists? Perhaps another post for another day…

Assorted Observations & Notes

My understanding of Stable Diffusion is that the model was trained on a data set released by LAION. There are a couple of tools to explore the data set used to train Stable Diffusion. I’ve played with this one, described here (note, there is NSFW content). Something that truly surprised me was the low quality of the captions. I had really expected that to provide good results the models would need excellent structured captions, yet it’s clearly not the case.
All these models thrive on the constraints provided by giving them an artist to ape. Looking at galleries of AI generated art, like the Midjourney Showcase and you’ll see a good number of the prompts including artists by name, sometimes many of them. For some reason “by Van Gogh” doesn’t nauseate me nearly the way “by Greg Rutkowski” does: this may just be the question of Capitalism again. There are already horrifying stories of models trained on single artists.
In a sense, my feelings about these programs are not directly affected by how they’re implemented; yet I find myself compelled to figure more out. I have only a rough understanding at the moment of how these systems are trained and deployed.
- This blog series by Lior Sinai, though I’m only through part one, seems very promising. It’s pushing my math skills though.
These models are far from the end of this work; Google has Imagen, Imagen Video, and Imagen Editor baking. Impressive results. The section on “Limitations and Societal Impact” is a worthwhile read: “There are several ethical challenges facing text-to-image research broadly. We offer a more detailed exploration of these challenges in our paper and offer a summarized version here. First, downstream applications of text-to-image models are varied and may impact society in complex ways. The potential risks of misuse raise concerns regarding responsible open-sourcing of code and demos. At this time we have decided not to release code or a public demo.”

I remember being in my CMPUT 466 Machine Learning class in Fall of 2011, when the prof started explaining deep learning. For a brief shining moment it felt like I had understood how deep learning worked… and then the math and the understanding largely abandoned me. Despite getting an A- in that course, I never felt confident in the area.

My interests of course were drawn elsewhere, but I had many opportunities to explore machine learning in various forms. Fairly consistently though, where the opportunity was machine learning I turned it down, repeatedly. For whatever reason, deep learning and its applications never spoke to me, and never really attracted me. Much of it felt like smoke and mirrors -- part of this was watching so many projects consume large amounts of resources, only to fail to find deployment. The places where it seemed to work never stood out to me. I am absolutely certain that to practitioners deep learning models felt revolutionary, I didn't see it myself and so didn't feel compelled to pay attention.

ChatGPT, DALL-E and Midjourney have forced me to acknowledge: We have crossed some sort of Rubicon with these large model technologies. I no longer have the option of ignoring them.

Yet, despite knowing that I have to pay attention… I have struggled mightily to form coherent thoughts here: since I haven't paid attention, I feel a bit like Rip van Winkle, awaking after a twenty years into a future I barely understand. There are so many dimensions here that it's hard to figure out what to think on any of them -- certainly there's many ways in which the dimensions cross.

I want to write down some thoughts about all of this (like every other nerd on the internet), so expect a few blog posts on this subject over a little while. Expect me to alternate hugely between wonder and loathing. Talking this over with a friend, one thing that stood out hugely in our conversation: You can draw wildly different conclusions about this technology depending on whether or not you start from the presumption of capitalism or not. This probably true of all automation technology, but it's pretty clear already that the image generation technology is going to put some artists out of work, and in world where these artists need to make art to eat, that's an upsetting outcome.

I have much to wrestle with, and it's challenging to sort through my thoughts on this. I think the best way for me to organize myself on this is to divide this initial thinking into two pieces: First, I will cover image generation using tools like Stable Diffusion, DALL-E and Midjourney. Next time I will write about Chat GPT.

Shot / New

Chaser / Old

DALL-E 2

Stable Diffusion

Midjourney

Conclusions

Assorted Observations & Notes