The Wire

In Writing

Album stream and interview: David Kant of Happy Valley Band talks about their ‘machine listening’ album

February 2017

New album ORGANVM PERCEPTVS is what happens when machines listen to pop songs and humans transcribe and play the results, says Happy Valley Band leader David Kant

The new Happy Valley Band album ORGANVM PERCEPTVS is a compilation of pop standards by the likes of Neil Young, Patsy Cline and Madonna interpreted as you’ve never heard them before – because they’re played as computers hear them. HVB’s founder David Kant began the project in 2009, as part of an exploration into auditory source separation and signal processing software. He takes pop recordings, runs them through source separation software to isolate the vocal track, then analyses the remaining waveforms to separate out the rest of a track’s instrumentation, and transcribes the results into hyper-specific sheet music interpretations for HVB to read and perform. Depending on how the original tracks were recorded, effects like studio panning or doubling instrumentation lines, or subtle shifts of pitch, are refracted and amplified through the software in often mystifying ways, resulting in warped interpretations that are unexpected, to say the least.

ORGANVM PERCEPTVS is released on 24 February by Indexical, complete with sleevenotes describing Happy Valley Band’s machine listening processes and analysis by David Kant, and an essay by Wire contributor Kurt Gottschalk. David Kant is interviewed by The Wire’s Deputy Editor Emily Bick.

Emily Bick: First I’d like to ask how you used software to separate the lines of instrumentation and melody. Because one of the things you do is isolate the vocals and they’re very clean.

David Kant: The first step of the process is to separate out all the instruments, and then the rest of the instruments that the instrumentalists in Happy Valley Band will play get run through a bunch of analysis, algorithms to extract the pitch and the rhythms and other sorts of musical features that end up on the notation. But the vocals, we leave as clean as they come out intact, and then play along with them. And the software that I use is some custom tools that are popular in machine learning and sound analysis research, machine learning tools for isolating and extracting instrument voices. The field is generally called source separation, or auditory scene analysis. It’s all software that I use to do it, it’s all stuff that I’ve been studying for a few years, but with my version of what to do with it.

You talk about finding layers of detail that the human ear can’t perceive, but its stuff that machines can pick up on and find differentiation. However our listening just doesn’t have the ability to process signals at that kind of speed or that kind of detail. So I wanted to ask about why you chose to get that level of detail to try and focus on, transcribing, dimensions of sound that we don’t usually put into notation?

For me, it’s not like I think we don’t hear it, I think we do hear it. I think our ears are sensitive to the minute fluctuations, and it’s when you notate it in conventional music that that stuff is left out. And so that’s what it’s playing with, that idea that these things are present in the signal. Our ears do hear these small fluctuations in pitch or timbre. We might hear them as envelopes of sound rather than as individual pitches, and it’s maybe being a little facetious about, well then, what stuff do you put into notation?

There’s one thing in your explanatory essay (included with the sleevenotes to ORGANVM PERCEPTVS ) where you talk about how you set the pace to the voice that’s recorded, with the syncopation and so on – instead of having some software that will suck all this music in, like through a midi file or something and set it to an arbitrary pulse.

Right, yeah.

What was the reason behind that, because it seems like a really particular aesthetic choice.

One of the reasons is practicality – just a matter of getting the band to be able to play this music. If it was kind of transcribed, notated to an arbitrary pulse, I think it would be more difficult. By synchronising it to the actual pulse of the original song, and then playing the vocals which give the pulse, it keeps the band together. And sure, tons of people use an arbitrary grid, and I thought it would be more interesting to pull the musical information out of what the pulse might have been, so I take from the original song, and then overlay it on that.

But setting the vocal like it was an alternative grid, that almost suggests that there is something that the song should sound like. So it does set your expectation as a listener to hear something. And it almost amplifies, I guess what would be heard as an error in some ways – you talk about being ‘wrong’ versus ‘too right’ in your sleevenotes. Could you expand on that, what you want people to be listening to, or hearing in this?

Sure – to keep the voice, because it gives some sort of reference. The music that the band plays, that the instrumentalists play, kind of go in and out of focus. At times it can be really far from the original song, and at other times it can be really close. That’s kind of a function of the two elements of the transcription coming into contact with each other, those two elements being the original song and then this transcription software. Vying for what’s really determining the music coming out. Sometimes it’s closer to the original tune, and other times, the built in assumptions or mechanisms that the analysis will operate will determine more of what’s coming out than the actual audio that’s going in, if that makes any sense. I think that’s interesting that we’ve decided to keep the voice for both aesthetic and practical reasons, the practical part being trying to get a group to actually play this music and try to stay synchronised, stay in time with one another. But as far as the errors in the artefacts, ultimately I want the songs to be something other than the original songs, that it not be a sort of error-laden version of the original tune, but that it take on a different character, a different quality.

So you’re using them to amplify things that are characteristics of a particular song?

DK: Yeah. The way the machine learning part of this works is in the separation part. So you use a machine learning algorithm that learns to identify the individual instruments and separate them. The reason that I use the particular algorithm that I use is because it doesn’t require a lot of training. Neural networks and things that have become a lot more popular in the last few years require a lot of training. And I wanted to work with something that didn’t require so much training, so that I could really use training to guide what was happening. And so an example – I was working with the Neil Young tune that’s on the record, it’s the piano and voice Neil Young song. And I grabbed – the opening few bars are without voice, and so it was a pretty easy separation because I just trained the classifier on these opening few bars without the voice, and then said: whatever doesn’t look like this, is the voice.

So a sort of binary separation, to separate the voice from the piano, that was all there was to it. I ran it through, and by the end of the tune it was a mess. The piano and the voice were just together. I couldn’t separate them, I was trying to figure out what the problem could be because it seemed like it should work so well – and the issue was that the recording drifts. The pitch of the recording ends up, I think it was about a quarter tone higher by the end. I don’t know why. And so that was an issue, because I thought it would be so simple, I had restricted the machine learning algorithm to such a specific piece because I didn’t think that there was anything else to it – and it just learned these initial piano chords and couldn’t incorporate the slight change in pitch. And so sometimes it’s as specific as that, I just need a few chords. But it can really, in some cases, overfit – the training can determine what’s coming out more than you want it to.

And you can’t really predict when those things will happen either, because in that example it sounds like you thought it would be a very straightforward separation and it was the opposite of that.

They would work maybe 70 per cent of the time, but then 30 per cent of the time there were weird artefacts. Things bled together in interesting ways, and that’s what interests me, that’s what I’m trying to reflect through the music. The idea that this learning process might learn it – almost, but not entirely. And then what is it about this other 30 per cent? Is it the training just wasn’t good enough, or the way the algorithm operates just isn’t quite right, doesn’t quite match what we actually do? Or, I like to think, it’s almost too good. Maybe those are the parts that if you really extend what it is doing to the limit, to the logical extension computationally, this is what it finds, revealing hidden relationships between sounds that I wasn’t hearing before. Because I was so set on – there were other factors that were influencing what I was hearing. I knew how the voice works, and I knew how the guitar works, and I knew that these parts of the sound were separate, but the algorithm finds that, oh, there’s actually this part of the voice that looks a little more like the guitar?

That was what really started to impact how I heard stuff, and then I could start hearing that. You start hearing those relationships all over the place. But the idea for me was that maybe our hearing isn’t so fixed, that there’s some malleability to it, and that’s what this project is about. It’s about, ‘OK, how can I reflect that, how can I turn this into music’. And so what I did was then to transcribe it back into music. There’s a few layers to it, from the separation through to transcription, but I really am trying to, in some sense of the word, do the best job I can, when it comes to the transcription, because I’m not trying to introduce extra artefacts, I’m actually trying to reflect and translate into music those artefacts of the separation process. And so, when there’s extra notes in a piano part, or extra snare hits on a drum track, it’s because there’s things bleeding into each other from the separation, and I’ve tuned the analysis of pitch and rhythm to be sensitive to those, and not filter them out.

You have a really nice line saying: ‘Machines hear in as many ways as we build them’. And one of the things that I really like about your project is that you seem to also be interrogating this idea of algorithms that make so many decisions about what we see and what we consume in the larger sense of media and what we see online, the ads that are served to us. Did you come into this project with this in mind?

Absolutely. When I came in, when I started, I wanted there to be an answer. I wanted there to be one way – I was in a sense very naive. I remember sitting down and trying to figure this stuff out, asking, how do computers hear? There must be an answer to this, and of course there’s no answer. There’s so many different approaches to it, and each approach is laden with its own set of assumptions, and whether those assumptions are explicit or not, depends on how it functions. Even things like the pitch tracker, which doesn’t – it’s not a learning technique in that sense, but it has an idea about what it means to hear a pitch. The whole thing is about measuring pitch in terms of a harmonic model of pitch. Or even learning algorithms that don’t require much training, or the ones that do, there’s a part of how they operate that’s the mechanism itself, that determines what comes out. This all became super interesting the more I worked with all this stuff, seeing that there’s no objective answer and you can do it this way, you can do it that way and they all produce different results. I’d hate to say that one’s better than the other. Some things are approached in a more physiological way, and others are approached in a more mathematical way, or a theoretical way, and whether one is better than the other, I have no idea. And I was living in this world of computer music and it was occurring to me that the stuff that was probably going on a lot outside of my creative practice, in automatic radio airplay, or search algorithms for audio, it made me think, ‘Well, I wonder what are those representations of sound?’

Only in the past few years has it come to fruition, at least to me, outside of music, especially recently, the more that we see the impact of learning algorithms in social media determining how we access what content we access through our windows to the world, that I really started to think, wow, this stuff is actually a lot more full blown than I could have even thought. And it made me really compelled to put this project out there. I mean, the project is working on music, it’s limited to that domain, but I think the implications go beyond it. Whether it’s musical algorithms or algorithms for other outlets, for social media, for news, for whatever it is, the idea that someone has designed this stuff and it operates – there’s something to its operation and someone’s making those decisions, and we should be aware of it, and we should more than be aware of it, but continue to have more of an influence in it.

What kind of audience reactions have you had when you’re performing live?

Some people are ecstatic, and they love it, or they say they’ve never heard anything like it, that’s insane, what in the world are you doing – always lots of questions, what is going on – because there hasn’t been an in-depth explanation until I wrote those liner notes that are not even published yet. So yeah, we play, and people are just confused, but excited – those are the good ones, confused but excited. Some people are upset and not happy, and some people are – they want to tell us that we’re doing it wrong, which is fun too.

Leave a comment

Pseudonyms welcome.

Used to link to you.