Jaan Altosaar and Ethan Benjamin explain their one-stop, browser-based app that creates sample clusters from your MP3 or SoundCloud files. Interview by Emily Bick
MusicMappr is an app that splits up any mp3 or Soundcloud link into samples, then runs spectrographic analysis on the samples, and clusters them together into similar-sounding groups. The results are displayed in a playable interface that runs in a browser window.
To play a cluster, type the letter in its centre. You can also mouse over individual sample points to play them one at a time, or scroll over the bar graph to the right that displays the track in a more linear display. The results are screwy and strangely addictive.
Its creators, Ethan Benjamin and Jaan Altosaar, developed the app, in part, to make composition with sampling techniques easier, and accessible to anyone in less than a minute. They posted a step by step how-to video on YouTube that goes into greater detail about how MusicMappr works, but it’s easy to start making sounds with a few quick clicks.
Ethan is an R&D engineer at Text IQ, where he develops machine learning and natural language processing algorithms; Jaan is a physics and machine learning grad student, working between Princeton and Columbia Universities. He also makes music as @lyfos (https://soundcloud.com/lyfos) and previously interned at Google DeepMind. They explained a bit about the project and their own interests in music production, data visualisation, and interface design.
Where did the idea for this project come from?
Jaan Altosaar: I’m obsessed with samples when I produce. But it’s not exciting to listen to hi-hats, kicks, snares, claps for hours, just to get to the fun of writing the actual parts of a track. When I noticed this, I was testing machine learning tools for studying Costa Rican bird sounds I had recorded on a rainforest field study. It felt natural to connect these ideas.
Just as I wanted to create a visualisation for different bird species in field recordings, I wanted to see how different parts of a track are related and save time on creating samples. I tried out the method and it was pretty delightful when applied to music rather than birdsong.
This sure removes some of the tedium of sourcing, uploading and converting files to sample – what made you think that this would be a good problem to solve?
JA: Thanks to Reggie Watts, I got an Electro-Harmonix 2880 Super Multi-Track Looper. In contrast to booting up my laptop, plugging in my audio interface, creating a new Ableton set, choosing VSTs, choosing samples and then playing the first notes, all I had to do was hit record, beatbox into the microphone and I had the first inklings of a track.
I was curious if Ethan and I could replicate the same feel in a web browser, and remove the many pains of sampling from music production. Rather than own a copy of Ableton and know how to chop and slice samples, a user just needs to have a web browser to use MusicMappr.
The goal was to go from raw audio to remix in 30 seconds in any web browser, regardless of whether a user had prior experience with sampling.
How did you come up with the interface design that lets users mouse over samples within a cluster, and play them one by one?
JA and EB: We tested a few different interfaces, and this was the simplest one we could think of that allowed enough flexibility for exploring a track’s samples.
The animated visualisation of the analysis looks like a swarm of points that settle into their final configurations, and the marked clusters look a bit like stained cells on slides for cytology analysis. How did you develop the design? Did you try any other visual displays before choosing this one?
JA and EB: The animation is a direct result of gradient descent, which is what we used to optimise the t-SNE cost function. t-SNE, or t-distributed stochastic neighbor embedding, is a visualisation technique that projects high-dimensional vectors (think: spectrograms of snippets of audio) into two-dimensions, with the goal of placing similar points in similar parts of the plot.
The cost function for t-SNE is a measure of how well the points in two dimensions represent the information contained in the points in higher dimensions. Gradient descent means we try to minimise cost by finding the optimal value of the placement of the two-dimensional points.
The points settling into their final configuration means that a local optimum has been found: the two-dimensional positions sufficiently represent the higher-dimensional information in the spectrograms. Audio samples that sound the same will have similar high-dimensional spectrograms, and will end up in similar parts of the map.
Inspired by Patatap, we wanted to let the user play clusters when the optimisation was finished. We used a simple clustering algorithm (k-means), with 26 clusters for each key of the keyboard. For the interface, we also took heavy inspiration from Andrej Karpathy’s tsne.js implementation and visualisation.
Is there a logic to how the clusters are distributed in the interface window? Does it mean anything if they are higher/lower, further to the left or right? And what about the distance between each dot and the centre of the cluster?
JA and EB: The cost function of the t-SNE algorithm we described is non-convex; this means it is very difficult to find the best placement of the points. Rather, each time the algorithm is run, we get a new set of sufficiently optimal two-dimensional positions for the samples. It’s an interesting research question for how to modify the t-SNE algorithm to yield consistent visualisations.
What artists do you listen to who use a lot of sampling in their work? The Prosthetic Knowledge blog, where I first found out about your work, mentioned that MusicMappr could help users create samples in the style of J Dilla, and your research paper uses a song by Swedish House Mafia featuring Pharrell Williams in an example illustration of a song mapped into clusters. Did you have any artists, or production techniques in mind when you began this project?
JA: I love J Dilla, Knxwledge, Reggie Watts, Jai Paul, Nicolas Jaar, Ariel Pink, Lxury, George Clanton (Mirror Kisses, ESPRIT 空想), Bisaillon, motion_correct, MF Doom, Jay Z, Vaporwave, old school hiphop, live looping, and the recent wave of melodic trap music also come to mind.
I tried this out with a number of tracks, and noticed more discontinuity within clusters where there was a vocal element. When producers choose samples without the help of a tool like this, they will often start and end the samples around complete vocal sounds, words or phrases but often these samples would stop halfway through a syllable, and even if samples within a cluster were equally similar as far as spectral analysis went, they sound more different because of these vocal cues.
Have you tested this with any users? Is this just a perception issue? Are there any other areas where clusters that contain mathematically similar samples seem more similar or more different to the listener?
JA and EB: This is a major issue and we’re glad you brought it up! It’s called the ‘semantic gap’ between the information we get from processing audio (like computing spectrograms), and the perception of listeners (does this sound different to a human?).
Audio with similar spectrograms can sound wildly different to a listener and drive the app berserk, especially with vocal cues as you point out. We tried circumventing this a little bit by using mel-frequency spectrograms. The mel-scale aligns better with human perception than other frequency scales.
Other examples of mathematically similar samples that are perceived as different would be bass drums and bass notes. A cool extension would be to filter out the vocals before doing the spectrogram analysis and clustering.
The time of each sample point sounds like the exact time frame it takes for a CD skip. Is this deliberate? And what do you think about music that deliberately builds on sampling at this time scale to play on nostalgia for the CD format?
(I am specifically thinking of genres like vaporwave, where long duration repetition of glitches that sounds like CD breakdown is part of an aesthetic that plays on a warped version of late 1980s early 90s corporate, smooth mall jazz sounds.) Did you consider any of this when setting the time of samples in MusicMappr?
JA and EB: We looove vaporwave! This was a practicality rather than a matter of taste. We tested beat detection methods to chop up songs on-beat but these weren’t robust. They messed up for classical music, jazz, and other songs where the location of the downbeat can be unclear even to well-trained musicians.
If a user found samples that they liked, how could they record them to use elsewhere?
JA and EB: We wanted to build in this feature but didn’t have time before the NIME conference deadline. But the code is open-source and the details are in our paper, so we’d love to see you develop this!
Could musicians use this in a live performance setting?
JA and EB: Definitely, but we wouldn’t trust it for robust performance. We’ve thought about building in recording of samples into MusicMappr. You could sing something or record yourself playing an instrument and then sample and remix what you just played in the app.
The code for MusicMappr is posted to Github, the paper is open-access, and Jaan and Ethan are open to suggestions to develop new features. If you have any ideas or questions about the app or its implementation, they can be contacted by email at firstname.lastname@example.org or email@example.com