MusicLM and AudioLM Google’s Text to Music and Audio Tool

MusicLM and AudioLM Google's Text to Music and Audio Tool


MusicLM and AudioLM are two next generation products being developed at Google, one is text to music and the other is text to audio. Researchers from Google, a tech giant best known for their search engine, announced a new generative Google AI model called MusicLM, also known as text to music generator, that can perform music generation from text descriptions, such as “a calming piano backed by a distorted violin.” This is an upgrade to the previous AI model known as AudioLM It can also transform a hummed melody into a different musical style and output music for several minutes.

Generating realistic audio requires modeling information represented at different scales. For example, just as music builds complex musical phrases from individual notes, speech combines temporally local structures, such as phonemes or syllables, into words and sentences. As of now MusicLM and AudioLM are not available to the general public, however in this article we will discuss the two of them and how they work.

Also Read: AI Generated Music from Audio Wave Data

What is MusicLM

Google researchers have made an AI that can generate minutes-long musical pieces from text prompts, and can even transform a whistled or hummed melody into other instruments. It was trained using a dataset of over 280,000 hours of music. This AI is known as MusicLM. MusicLM can answer your queries, however, only in the form of music. Google MusicLM can instantly create music based on a text-based query in no time. What’s even more interesting is the AI can even read images and its description to create music that syncs with the picture.

It can instantly create music in any genre just like an experience music producer could do. However, unlike a human producer, who would be familiar with just a couple of instruments and music forms, Google’s MusicLM can create short, medium, and long-form music in almost any genre. This includes but is not limited to relaxing jazz, melody techno, bella-ciao in humming form, whistle form, Capella chorus form, and generation of music from an art description.

MusicLM supports all the major music genres across the world, which includes 8-bit, big beat, British indie rock, folk, reggae, hip hop, motivational music, electronic songs, music for sports, high-fidelity music, pop songs and Peruvian punk.

Google has even shared the bits of music from all these genres that are generated by MusicLM, which even includes sountracks from arcade games. While it can create music like a beginner music producer, it can also create coherent songs just like a professional too. Again, all you have to do is specify your requirements in the text description and the type of instrument to help MusicLM produce the exact style of music or tune that you are looking for and what experience level you want the music to be produced at. In the same context, it can also produce a variety of music, offering a lot of options to the user.

The examples are impressive. There are 30-second snippets of what sound like actual songs created from paragraph-long descriptions that prescribe a genre, vibe, and even specific instruments, as well as five-minute-long pieces generated from one or two words like “melodic techno.” MusicLM can even simulate human vocals, and while it seems to get the tone and overall sound of voices right, there’s a quality to them that’s definitely off. It sounds grainy and off tone. A lot of the times the lyrics are nonsense, but in a way that you may not necessarily catch if you’re not paying attention.

Intuitively, AI tools like MusicLM which can reduce the barrier to creating music should mean a bigger payday for music platforms. The ease of creating music would mean more music creators. Surely, more music bringing in more listeners should then translate to more revenues. This is valid logic. However, it could also turn out to be flawed thinking.

The growth of text-to-music AI tools could birth “generative recommender algorithms”. Think of it as music streaming services powered by algorithms that generate music on the go and recommend them to you based on your interests, like TikTok automatically generating and recommending new videos to you based on your interests.

This could create one direct problem—less reliance on the traditional music streaming model. Music streaming services would then have to adapt or become less relevant. Akin to what stock image sites are currently doing in response to the rise of AI art, music streaming platforms would be better protected if they take the initiative to host these generative recommender algorithms on their platforms.

Also Read: Redefining Art with Generative AI

What is AudioLM

Google’s research group has launched AudioLM, a framework for producing high-quality audio that maintains consistency across time. To do this, it begins with a recording that is just a few seconds long and is capable of extending it naturally and logically. Generating realistic audio requires modeling information represented at different scales. For example, just as music builds complex musical phrases from individual notes, speech combines temporally local structures, such as phonemes or syllables, into words and sentences.

Creating well-structured and coherent audio sequences at all these scales is a challenge that has been addressed by coupling audio with transcriptions that can guide the generative process. This can be anything from text for text to speech or even MIDI files for music. The key intuition behind AudioLM is to leverage advances in language modeling to generate audio without being trained on annotated data.

There are some challenges though when moving text to audio. Two of them are listed below:

  • First, one must cope with the fact that the data rate for audio is significantly higher, thus leading to much longer sequences. A written sentence can be represented by a few dozen characters, its audio counterpart typically contains hundreds of thousands of values.
  • Second, there is a one-to-many relationship between text and audio. This means that the same sentence can be rendered by different speakers with different speaking styles, emotional content and recording conditions.

The most impressive aspect of AudioLM is that it does generates audio without being taught with previous transcripts or annotations, despite the fact that the created speech is syntactically and semantically reasonable. Furthermore, it preserves the speaker’s identity and prosody to the point that the listener is unable to determine which piece of the audio is genuine and which was created by artificial intelligence.

The applications of artificial intelligence are astounding. It can not only mimic articulation, pitch, timbre, and intensity, but it can also introduce the sound of the speaker’s breath and make understandable phrases. If it’s not from a studio but rather from a recording with background noise, AudioLM mimics it to ensure continuity. You can listen to some audio on the AudioLM website.

MusicLM Pytorch

Even though MusicLM is not available yet to the public, it is not stopping some people from attempting to create it in Pytorch. PyTorch is a machine learning framework based on the Torch library, used for applications such as computer vision and natural language processing, originally developed by Meta AI and now part of the Linux Foundation umbrella. It is free and open-source software released under the modified BSD license.

The code for MusicLM is unknown as of now, however the code for AudioLM is known. So in order to try and replicate MusicLM, they are using a text conditioned version of AudioLM with the contrastive learned model called MuLan. MuLan was a first attempt at a new generation of acoustic models that link music audio directly to unconstrained natural language music descriptions. MuLan takes the form of a two-tower, joint audio-text embedding model trained using 44 million music recordings (370K hours) and weakly-associated, free-form text annotations.

Below is some code from the project showing MuLan being trained:

import torch
from musiclm_pytorch import MuLaN, AudioSpectrogramTransformer, TextTransformer

audio_transformer = AudioSpectrogramTransformer(
    dim = 512,
    depth = 6,
    heads = 8,
    dim_head = 64,
    spec_n_fft = 128,
    spec_win_length = 24,
    spec_aug_stretch_factor = 0.8

text_transformer = TextTransformer(
    dim = 512,
    depth = 6,
    heads = 8,
    dim_head = 64

mulan = MuLaN(
    audio_transformer = audio_transformer,
    text_transformer = text_transformer

# get a ton of <sound, text> pairs and train

wavs = torch.randn(2, 1024)
texts = torch.randint(0, 20000, (2, 256))

loss = mulan(wavs, texts)

# after much training, you can embed sounds and text into a joint embedding space
# for conditioning the audio LM

embeds = mulan.get_audio_latents(wavs)  # during training

embeds = mulan.get_text_latents(texts)  # during inference

If you want to help in the creation of MusicLM or see how far the project has come along go to their GitHub.

MusicLM and AudioLM Architecture 

Figure showing part of MusicLM’s process, which involves SoundStream, w2v-BERT, and MuLan.

A figure explaining the “hierarchical sequence- to-sequence modeling task” that the researchers use along with AudioLM, another Google project. Source – Google.

Also Read: 12 Apps and Tools To Make Music With Artificial Intelligence


Google is being more cautious with MusicLM than some of its competitors may be with their own music generators, as it has been with prior excursions into this form of AI. As they have stated, there are no plans to disclose the model at this point in time. You may be wondering why they have chosen to do this when things such art generators already exist. Well there are some risks of potential misappropriation. One possibility is it introduces the possibility of producing music copyright. Another possibility is that it could begin putting song writers out of business, as it is good at coming up with creative content.

Handbook of Artificial Intelligence for Music: Foundations, Advanced Approaches, and Developments for Creativity
Buy Now
We earn a commission if you make a purchase, at no additional cost to you.
02/18/2024 08:06 am GMT

During an experiment, Google found that about 1% of the music the system generated was directly replicated from the training dataset. Apparently Google at the moment is not satisfied with this model yet. Assuming MusicLM or a system like it is one day made available, it seems inevitable that major legal issues will come to the fore. It seems that at the moment Google does not want to deal with these issues and is thus keeping MusicLM out of the hands of the public for now.


MusicLM. Accessed 6 Feb. 2023.

Wiggers, Kyle. “Google Created an AI That Can Generate Music from Text Descriptions, but Won’t Release It • TechCrunch.” TechCrunch, 27 Jan. 2023, Accessed 6 Feb. 2023.