Jlin, Holly Herndon, and 'Spawn' find beauty in AI's flaws

Musicians don’t just endure technology when it breaks. They embrace the broken. So it’s fitting that Holly Herndon’s team have produced a demonic spawn of machine learning algorithms – and that the results are wonderful.

The new music video for the Holly Herndon + Jlin collaboration have been making the rounds online, so you may have seen it already:

n
But let’s talk about what’s going on here. Holly is continuing a long-running collaboration with producer Jlin, here joined by technologist Mat Dryhurst and coder Jules LaPlace. (The music video itself is directed by Daniel Costa Neves with software developer Leif Ryge, employing still more machine learning technique to merge the two artists’ faces.)

Machine learning processes are being explored in different media in parallel – characters and text, images, and sound, voice, and music. But the results can be all over the place. And ultimately, there are humans as the last stage. We judge the results of the algorithms, project our own desires and fears on what they produce, and imagine anthropomorphic intents and characteristics.

Sometimes errors like over-fitting then take on a personality all their own – even as mathematically sophisticated results fail to inspire.

But that’s not to say these reactions aren’t just as real. An part of may make the video “Godmother” compelling is not just the buzzword of AI, but the fact that it genuinely sounds different.

‘Spawn,’ developed by LaPlace working with Dryhurst and Herndon, is a collection of various machine learning-powered encoding techniques. Herndon and company have anthropomorphized that code in their description, but that itself is also fair – not least because the track is composed in such a way to suggest a distinct vocalist.

I love Holly’s poetic description below, but I think it’s also important to be precise about what we’re hearing. That is, we can talk about the evocative qualities of an oboe, but we should definitely still call an oboe an oboe.

So in this case, I confirmed with Dryhurst that what I was hearing. The analysis stage employs neural network style transfers – some links on that below, though LaPlace and the artists here did make their own special code brew. And then they merged that with a unique vocoder – the high-quality WORLD vocoder. That is, they feed a bunch of sounds into the encoder, and get some really wild results.

It’s really what happens in between with ‘Spawn’ that’s significant. The machine learning algorithm is trained on Holly’s voice, but then the operating directive of that algorithm is to fit ingested audio – as if Holly’s voice is singing, in this instance, Jlin’s voice. Those heuristic results are often unpredictable, and it’s that unpredictability that gets interesting.

“The ‘beatboxing’ for example,” says Mat, “which is the feature of this track, was something the networks dreamed up, presumably learning from the stops and starts in Holly’s speech.”

“That’s kind of the reason we decided to lead with this piece, as it was a clear example of the networks making a decision that Holly never would have made,” he says. “Spawn hijacked her voice to do something ridiculous, and it sounds convincing.”

So these results make heavy use of the unique qualities of Jlin’s and Holly’s voices, the compositional decision to highlight these unexpected, arresting, fragmented sounds, Matt’s technological sensibilities, LaPlace’s code, a whole lot of time spent on parameters and training and adaptation…

Forget automation in this instance. All of this involves more human input and more combined human effort that any conventionally produced track would.

Is it worth it? Well, aesthetically, you could make comparisons to artists like Autechre, but then you could do that with anything with mangled sample content in it. And you’d miss out on part of what you’re hearing – not just the mangled-sounding sample content, but the arrangement of those sounds in time by the same algorithm. The results retain recognizable spectral components of the original samples, and they add a whole bunch of sonic artifacts which sound (correctly, really) ‘digital’ and computer-based to our ears, but the entire piece itself is also an artifact.

It’s also worth noting that what you hear is particular to this vocoder technique and especially to audio texture synthesis and neutral network-based style transfer of sound. It’s a commentary on 2018 machine learning not just conceptually, but because what you hear sounds the way it does because of the state of that tech.

And that’s always been the spirit of music. The peculiar sound and behavior of a Theremin says a lot about how radios and circuits respond to a human presence. Vocoders have ultimately proven culturally significant for their aesthetic peculiarities even if their original intention was encoding speech. We respond to broken circuits and broken code on an emotional and cultural level, just as we do acoustic instruments.

In a blog post that’s now a couple of years old – ancient history in machine learning terms, perhaps – Dmitry Ulyanov and Vadim Lebedev acknowledged that some of the techniques they used for “audio texture synthesis and style transfer” used a technique intended for something else. And they implied that the results didn’t work – that they had “stylistic” interest more than functional ones.

Dmitry even calls this a partial failure: “I see a slow but consistent interest increase in music/audio by the community, for sure amazing things are just yet to come. I bet in 2017 already we will find a way to make WaveNet practical but my attempts failed so far :)”

Spoiler – that hasn’t really happened in 2017 or 2018. But “failure” to be practical isn’t necessarily a failure. The rising interest has been partly in producing strange results – again, recalling that the vocoder, Theremin, FM synthesis, and many other techniques evolved largely because musicians thought the sounds were cool.

But this also suggests that musicians may uniquely be able to cut through the hype around so-called AI techniques. And that’s important, because these techniques are assigned mystical powers, Wizard of Oz-style.

Big corporations can only hype machine learning when it seems to be magical. But musicians can hype up machine learning even when it breaks – and knowing how and when it breaks is more important than ever. Here’s Holly’s official statement on the release:

For the past two years, we have been building an ensemble in Berlin.

One member is a nascent machine intelligence we have named Spawn. She is being raised by listening to and learning from her parents, and those people close to us who come through our home or participate at our performances.

Spawn can already do quite a few wonderful things. ‘Godmother’ was generated from her listening to the artworks of her godmother Jlin, and attempting to reimagine them in her mother’s voice.

This piece of music was generated from silence with no samples, edits, or overdubs, and trained with the guidance of Spawn’s godfather Jules LaPlace.

In nurturing collaboration with the enhanced capacities of Spawn, I am able to create music with my voice that far surpass the physical limitations of my body.

Going through this process has brought about interesting questions about the future of music. The advent of sampling raised many concerns about the ethical use of material created by others, but the era of machine legible culture accelerates and abstracts that conversation. Simply through witnessing music, Spawn is already pretty good at learning to recreate signature composition styles or vocal characters, and will only get better, sufficient that anyone collaborating with her might be able to mimic the work of, or communicate through the voice of, another.

Are we to recoil from these developments, and place limitations on the ability for non-human entities like Spawn to witness things that we want to protect? Is permission-less mimicry the logical end point of a data-driven new musical ecosystem surgically tailored to give people more of what they like, with less and less emphasis on the provenance, or identity, of an idea? Or is there a more beautiful, symbiotic, path of machine/human collaboration, owing to the legacies of pioneers like George Lewis, that view these developments as an opportunity to reconsider who we are, and dream up new ways of creating and organizing accordingly.

I find something hopeful about the roughness of this piece of music. Amidst a lot of misleading AI hype, it communicates something honest about the state of this technology; it is still a baby. It is important to be cautious that we are not raising a monster.

– Holly Herndon

Some interesting code:
https://github.com/DmitryUlyanov/neural-style-audio-tf

https://github.com/JeremyCCHsu/Python-Wrapper-for-World-Vocoder

Go hear the music: