You can listen to samples from the project here: Soundcloud: insearchofconvergence
Title: In Search of Convergence
Categories: AI/ML, Performance, Music
Abstract:
This project explores how AI may be used as a tool for creating a live musical performance. Different from existing systems, whose goals are often to reconstruct human music as closely as possible, this project considers how AI can be used to craft a new type of performance entirely, by creating a feedback loop where AI and humans listen to, and in turn, perform based off of each other’s generated audio.
Existing models are usually built with the goal of replicating existing audio, or at least existing styles of audio, as accurately as possible. These models, intentionally or not, are positioned and often act as replacements to human artists. Technology made to replace human artists is not a new concept in music.
For example, the Roland TB-303 is a synthesizer originally created to replace bassists in the studio. But creative and unintended usage of this instrument gave birth to acid house, and arguably, to electronic music as a whole.
Similarly, I propose considering using generative AI in a similar way - repurposing a tool meant to replace humans into a tool that complements human performance. Widening our perspective on how these tools can be used can lead to new kinds of music-making.
Technically, this project explores the technical problem of conditional stem generation. In the field of music information retrieval, conditional stem generation (generating the missing stem xn given the set of stems for a given song, minus xn ) is unsolved. This project proposes an open source, performative, method for solving this problem.
Thesis Statement:
How can AI be used as a tool for live musical performance? Specifically, how can we imagine using AI in a way that complements, rather than replaces, musicians. This project uses the technical problem of conditional stem generation to explore this question.
Technical Details:
This project required constructing a robust data pipeline. This pipeline consisted of several tasks of which the main ones are listed below:
Convert a song to a set of stems. A stem can be defined as a single instrument’s part of a song. For example, a rock song may have vocal, guitar, bass and drum stems. This alone is a difficult task for many instruments. This process must be done as accurately as possible - data quality is very important for the following steps to work. I use both AI-enabled tools (Spleeter, Demucs) and more traditional algorithmic processes for isolating the stems.
Conditional stem generation. For example, if you would like the AI to generate a guitar part, you must have a model predict a guitar melody based on the rest of the song - for example - the vocal, bass and drum part. I use MusicGen - both base and fine-tuned versions, for this task. This requires significant experimentation in prompt (both text and audio) engineering, model training, fine-tuning, and hyperparameter tuning.
Variation generation. An additional step is added to create variations of the generated stems. This is accomplished through Vampnet - a model that is specifically trained to create melodic and rhythmic variation. Creating audio of sufficient quality again requires significant experimentation in audio-prompt engineering and hyperparameter tuning.
A significant amount of other technical work is required: the data-intensive nature of this work requires a robust pipeline for handling, analyzing, storing, and retrieving this data. Audio engineering work - done in various DAWs (Ableton, Logic, Reaper) and via various Python libraries (librosa, torchaudio) is also required.
Research / Context
This project involved both creative and technical research. Creatively, I was interested in experimental kinds of music performance - mostly using AI. Most performances that use AI seem to do so in a very controlled way. Model output is prerecorded and is, in some way, reviewed, edited, and approved before usage in a live performance. I was curious about what happens when going a step further - starting to remove those constraints and letting the AI perform without human intervention.
It was also important to me that the AI generated music was used as an input to a performance by human musicians. Generating music that sounds “good”, unconditionally, is no longer really a difficult task. But this music often sounds generic, and it feels trivial to arbitrarily generate music and incorporate it into a performance. Generating music conditional on text is a harder problem, and generating music conditional on music is even harder. I couldn’t really find any examples of people trying to do this, and I was curious to see what happened!
I also had become very interested in the technical problem of conditional stem generation. It is currently unsolved (at least, no open source solution exists). I felt, and still strongly feel, that even an approximate solution to this problem would create huge opportunities for interesting and exciting musical performance.
This project involved collaborating with audio engineers to explore alternative methods of composition using the AI-generated stems as a base. This project also involved collaborating with musicians to record demo tracks in a studio. These experiments gave me valuable information for fine-tuning my process, and preparing for a full performance.