MusicLM and AudioLM

Introduction

MusicLM and AudioLM solved a problem that text-to-image models had punted on for years. They made raw audio behave like a language model that responds to plain English. Google published the MusicLM paper in January 2023 with a 280,000 hour training corpus. It also released the 5,521 example MusicCaps benchmark that any researcher can still download today. The launch turned a niche piece of audio research into a public AI Test Kitchen demo by May 2023. Those same ideas now power the Lyria 3 music model that ships inside the Gemini API in 2026. This guide explains how MusicLM and AudioLM fit together, what MuLan does, and how to use the stack today.

Quick Answers on MusicLM and AudioLM

What is the difference between MusicLM and AudioLM?

AudioLM is a general audio language model that predicts the next token of any sound. MusicLM stacks text conditioning on top of AudioLM so a written prompt can steer the music output.

Is AudioLM text-conditioned out of the box?

No. AudioLM alone is unconditional and continues an audio prompt. Text conditioning enters through MusicLM, which uses MuLan embeddings to turn a caption into audio-style tokens AudioLM can decode.

How can I use MusicLM and AudioLM today in 2026?

The MusicLM research demo is retired. The same lineage ships as Lyria 3 inside the Gemini app, AI Studio, and the Gemini API, with a free tier and a paid Lyria 3 Pro endpoint.

Key Takeaways for Anyone Trying MusicLM

MusicLM and AudioLM are stacked, not separate. AudioLM is the audio token language model and MusicLM is the text-to-music wrapper.
The training corpus is 5 million unlabeled clips and 280,000 hours. MusicCaps is the small, captioned evaluation set used in the paper.
The public face today is Lyria 3, not the original MusicLM demo. Google folded the AI Test Kitchen experiment into Gemini.
Memorization is small but real. Google measured exact reproduction in roughly one percent of MusicCaps generations.

Introduction
Quick Answers on MusicLM and AudioLM
Key Takeaways for Anyone Trying MusicLM
Understanding MusicLM and AudioLM at a Glance
What MusicLM and AudioLM Actually Are
How AudioLM Models Sound as Language
How MusicLM Adds Text Conditioning on Top of AudioLM
The MuLan Joint Music-Text Embedding Explained
Training Data: MusicCaps, AudioSet, and the 280,000-Hour Corpus
How to Access MusicLM and Lyria 3 in 2026
From MusicLM to Lyria: Google’s Music AI Roadmap
How MusicLM Compares With Suno, Udio, and Stable Audio
Implementation Notes for Developers Using the Lyria API
How to Generate Your First Track With MusicLM
- Step 1 – Create a Gemini API key
- Step 2 – Install the Gemini Python SDK
- Step 3 – Call the Lyria 3 endpoint
- Step 4 – Use reference audio for style transfer
- Step 5 – Verify the SynthID watermark
- Step 6 – Stream long generations with Lyria RealTime
Key Insights From Building With MusicLM and AudioLM
Real-World Examples of MusicLM in Use
- YouTube Shorts Lyria-powered backing tracks
- MusicLM AI Test Kitchen public demo
- AI Test Kitchen story-mode multi-prompt sessions
Case Studies of Google’s Music AI in Production
- Case Study: Universal Music Group licensing deal for Lyria
- Case Study: DeepMind Lyria RealTime in interactive demos
- Case Study: SynthID Audio rollout against streaming bots
Risks, Copyright, and the Memorization Problem
Ethics, Musician Compensation, and Industry Pushback
The Future of Text-to-Music After Lyria 3
Frequently Asked Questions About MusicLM and AudioLM

Understanding MusicLM and AudioLM at a Glance

MusicLM and AudioLM are paired Google models that generate high-fidelity audio from a caption or an audio prompt, with MuLan bridging text to music. Together they form the research foundation for the Lyria family of music generators in 2026.

MusicLM Stack Explorer

Pick a prompt style and a track length to see which layers of the stack do the work and how many tokens AudioLM must predict.

Prompt style Track length (seconds)

5s60s300s

Output fidelity

Where the work happens

Estimated token budget

MuLan text tokens

128per prompt

Semantic tokens (w2v-BERT)

1500at 25 Hz

Coarse acoustic tokens

3000at 50 Hz

Fine acoustic tokens

36000at 600 Hz

Source: AudioLM paper (arXiv 2209.03143) and MusicLM paper (arXiv 2301.11325). Token rates approximate SoundStream and w2v-BERT defaults. Widget by aiplusinfo.com.

What MusicLM and AudioLM Actually Are

MusicLM and AudioLM are paired Google Research models built between 2022 and 2023. The easiest way to keep them straight is to think of AudioLM as the language model and MusicLM as the writer who hands AudioLM a topic sentence. AudioLM was published first in September 2022 as a way to predict the next slice of any audio waveform using only audio tokens. MusicLM followed in January 2023 and added the missing piece, which was text conditioning powered by a separate model called MuLan. Read together, the two papers form a single text-to-music architecture rather than two competing systems. The split matters because press coverage often blurs them into one model and misses why the design landed where it did. Practitioners need to keep the layering clear to understand what each component contributes.

The architectural separation also explains the audio quality jump in the MusicLM samples. AudioLM had already proven that a token-based language model could generate coherent piano music for minutes at a time. It maintained harmonic structure and recognizable speaker identity in speech without any transcript. MusicLM did not have to reinvent that decoding pipeline at all. The new work focused on taking a caption like a dinner party prompt and producing audio tokens AudioLM already knew how to extend. That is the role of the MuLan joint embedding, which lets a sentence and a piece of music share one vector space. Without MuLan, MusicLM would be unable to honor a written prompt.

For practitioners trying to keep up with the rise of AI music generators, the takeaway is that MusicLM and AudioLM are a stack, not a competition. Newer Google models like Lyria 1, Lyria 2, and Lyria 3 are direct descendants of this stack. They all use the same core pattern of a text encoder feeding an audio language model decoder. The implementation details have shifted with each release, but the high-level picture is unchanged today. If you have read the original MusicLM paper, you already know most of what Lyria 3 does under the hood. That is why the 2023 papers remain the right starting point in 2026.

How AudioLM Models Sound as Language

Beyond the brand names, the actual trick that AudioLM pulls off is treating audio the same way a text model treats words. The AudioLM paper introduces a token hierarchy with three rungs. Each rung is modeled by its own Transformer decoder trained on its own slice of the data. The first rung carries semantic tokens that capture high-level structure. The second rung carries coarse acoustic tokens for instrument timbre and rhythm. The third rung carries fine acoustic tokens for the high-frequency detail like attack and decay. Splitting the problem this way is what lets AudioLM generate minute-long performances that still sound like one piece.

The semantic stage runs at about 25 Hz inside AudioLM. That means roughly 1,500 tokens for a one-minute clip, which is small enough for a standard Transformer to model. The coarse acoustic stage operates at 50 Hz and gives the model enough room to commit to specific instrument timbres. The fine acoustic stage runs at hundreds of hertz per quantizer and reconstructs the high-frequency detail. Each stage is conditioned on the output of the previous stage in strict bottom-up order. That layered structure is what gives AudioLM its stability over long generation windows. It is also what makes the model expensive to train, since each rung needs its own dedicated weights.

This hierarchy is also why AudioLM is not text-conditioned out of the box. The model sees only audio tokens during training and inference. To extend a piece, you give it an audio prompt and it continues the sequence one token at a time. To get text in, you need a separate encoder that produces tokens in a compatible space. That separate encoder is MuLan, which closes the loop between caption and audio. Builders who have studied how neural networks underpin Transformers will recognize the pattern as a standard encoder-decoder split. The novelty in MusicLM and AudioLM is applying that split to raw audio at scale.

How MusicLM Adds Text Conditioning on Top of AudioLM

Building on the AudioLM hierarchy, MusicLM solves the bridge between a written prompt and AudioLM’s audio tokens. The MusicLM paper describes a three-stage hierarchical sequence-to-sequence pipeline that drives the generation end to end. The first stage maps a text caption to MuLan audio tokens via the joint embedding space. The second stage maps those tokens to semantic AudioLM tokens for long-range structure. The third stage maps semantic tokens to acoustic SoundStream tokens at sample rate. The model is trained end to end on roughly 5 million unlabeled audio clips totaling 280,000 hours. The end result is a system that responds to free-form captions with audio that follows both genre and instrumentation cues.

From there, the implementation question becomes how MuLan creates the connection between text and music tokens. During training, MuLan saw 44 million music tracks paired with weakly aligned text annotations. The text included titles, hashtags, playlist names, and short descriptions scraped from public sources. It learned to project both modalities into the same 128-dimensional space using a contrastive objective. The MusicLM authors then froze MuLan and reused it as a text-to-music adapter at inference time. The caption goes through MuLan’s text tower, producing a sequence of audio-style tokens. Those tokens act as a soft prompt to the AudioLM backbone, which is now conditioned on a written description rather than a hummed melody.

Beyond text prompts, the same machinery also handles non-text inputs. MusicLM can be conditioned on a hummed melody, a whistled phrase, or even an image’s caption. That works because MuLan is symmetric between audio and text inside the embedding space. The system also supports a story mode feature highlighted in the original paper. You feed several time-stamped prompts and the model crossfades between them across a single clip. This is the part of MusicLM that surprised researchers at the time of release. It implied the model had learned a smooth manifold between very different musical descriptions.

Next, the trade-off in this layered MusicLM architecture is generation latency at inference time. Each stage produces tokens that feed the next, so the pipeline runs sequentially rather than in parallel. A 30-second clip can take several seconds of GPU time even on a TPU v4 setup. Lyria 3 narrowed that gap by reducing the number of fine acoustic stages and parallelizing the coarse stage. The core text-conditioning path is still the MuLan-to-semantic-to-acoustic chain that MusicLM introduced. If you study the Lyria 3 system card, you will see the same shape with a different wrapper around it. Builders who track tell the difference between AI and human music debates need to understand this chain.

The MuLan Joint Music-Text Embedding Explained

Shifting to the encoder side, MuLan is the single most overlooked component in the public AI music ecosystem. That is a mistake because MuLan is what makes the prompt actually work in practice. MuLan stands for Music-Language pretrained model and was described in a 2022 Google paper. It is a two-tower contrastive model with one tower for audio and one for text. The audio tower processes raw waveforms through a ResNet-style network. The text tower processes captions through a BERT-style encoder for token sequences. Training drives the two outputs toward each other when the caption matches the clip. The shared space is 128 dimensions, which is small for a foundation model and intentionally so.

Beyond the architecture, the training data is where MuLan gets its breadth. Google paired 44 million music recordings with 370,000 hours of audio against weakly associated text. That weak text includes track titles, user-generated playlist names, social tags, and short scraped descriptions. None of this required a hand-labeled corpus, which is exactly why MuLan was feasible at scale. The downside is that the resulting embedding inherits all the bias of the source platforms. Western pop, hip hop, and EDM are heavily represented in the data. Classical, traditional, and non-English-language music are thinner and harder for the embedding to reach. Anyone debugging weak MusicLM outputs is usually running into a MuLan coverage gap rather than an AudioLM weakness.

Training Data: MusicCaps, AudioSet, and the 280,000-Hour Corpus

Beyond the architecture, the next question builders ask is where the 280,000 hours actually come from. The MusicLM paper does not publish a clip list or per-source breakdown. Google confirmed in the supplementary material that the unlabeled training set is a large internal collection of music recordings. The publicly available piece is MusicCaps, a 5,521 example evaluation set used in the paper. Each clip is captioned by a professional musician in two formats, a free-form caption and a list of tags. MusicCaps is the benchmark used in every comparison plot in the paper. It is downloadable from Google’s MusicCaps Kaggle dataset page. If you want to evaluate your own text-to-music system, MusicCaps is the standard you need to beat.

Next, the supporting dataset is AudioSet, Google’s older general-audio corpus. AudioSet contains over two million 10-second clips drawn from YouTube and labeled with 632 categories. It was used to pretrain components like w2v-BERT and SoundStream for the AudioLM stack. AudioSet is not a music-only dataset, which is exactly why Google had to build MuLan on top. Researchers studying audio synthesis often combine AudioSet for environmental sounds with MusicCaps for evaluation. That combination is what gives MusicLM the breadth to handle environmental prompts. A street-scene prompt with a saxophone in the distance is exactly the kind of cross-domain audio AudioSet supports.

On top of MusicLM’s original corpus, Google has not republished the dataset details for Lyria 3. DeepMind has confirmed in public statements that the training corpus expanded substantially after 2024. The Magenta team’s public talks suggest the corpus now includes licensed catalog music negotiated with major labels. The shift toward licensed corpora is partly a response to copyright lawsuits hitting other generative music vendors. It is also a quality strategy, since licensed multitrack stems give the model finer control over instrumentation. If you evaluate MusicLM and AudioLM heritage models for production use today, the training corpus story is the part that has changed the most. The core architecture has barely moved since the original paper, but the corpus and the licensing arrangements have.

How to Access MusicLM and Lyria 3 in 2026

Turning to access, the original MusicLM AI Test Kitchen demo opened in May 2023 and is no longer accepting new prompts. As of June 2026, Google has fully retired the standalone MusicLM web app from public use. The capability now lives inside Lyria 3 across the Gemini product surface. The Lyria 3 model is reachable through three surfaces, depending on how technical you want to be. The Gemini consumer app exposes Lyria 3 as the create music tool for users on a paid plan. Google AI Studio exposes the same model as a free experimental endpoint with rate limits. The Gemini API and Vertex AI expose Lyria 3 as a billable model for production workloads.

Beyond the surfaces, the pricing tiers are concrete and worth knowing before you build. The free AI Studio path lets you generate short clips but throttles after a few requests per minute. The Gemini Plus tier at roughly twenty dollars a month unlocks Lyria 3 Clip for 30-second tracks. The Gemini Pro tier and the metered Gemini API unlock Lyria 3 Pro for 180-second tracks at 48 kHz. The Pro tier also gives access to Lyria RealTime for streaming generation and SynthID watermark detection. Production teams that need stem separation or watermark verification should plan on the Gemini API path directly. The consumer apps work but expose only a fraction of the controls you get through the API.

From MusicLM to Lyria: Google’s Music AI Roadmap

Looking ahead from the original release, the brand has shifted under the same research lineage many times. The roadmap from May 2023 to mid-2026 is mostly a story about scaling and licensing. MusicLM launched in May 2023 as the public AI Test Kitchen demo with a 20-second cap. Lyria, the first DeepMind branded successor, arrived in November 2023 inside YouTube Shorts as a backing track generator. Lyria 2 followed in 2024 with longer outputs and the first version of stem-level control surfaced to users. Lyria RealTime was announced via the Lyria RealTime API launch in 2025. The streaming variant was Google’s response to demand for interactive music generation in apps.

From there, Lyria 3 arrived in late 2025 as the version that consolidated all threads into a single model. The headline shift is that Lyria 3 supports up to 180-second tracks at 48 kHz inside the Gemini ecosystem. It accepts not just text but reference audio, key signature constraints, and BPM lock controls. Every Lyria 3 output ships with a SynthID watermark embedded in the waveform for provenance. Each major release also bumped the training corpus and expanded the supported genres meaningfully. Prompts that returned empty results in MusicLM now return usable tracks in Lyria 3 across most genres. The progression is consistent across the public benchmark scores Google has chosen to publish.

Beyond the headline numbers, the Magenta team has been the consistent thread through this roadmap. Magenta predates MusicLM by years and has shipped DDSP, MusicVAE, and Coconet to the open-source community. The team’s instincts shape the controls Lyria exposes, including the BPM lock and the key-and-mode constraint. The way stems are surfaced to users in Lyria 3 also has Magenta’s fingerprints all over it. If you watched Magenta’s prior work, the Lyria 3 control surface looks immediately familiar. The Magenta blog remains the best public source for understanding what the model can and cannot do today. It is also where the team posts experimental prototypes ahead of the stable Gemini release.

In practice, for organizations evaluating Nvidia’s Fugatto audio production model against the Lyria line, this roadmap matters. Google’s pattern is to add controllability and licensing first, then push fidelity once both are stable. The competitive pressure from Suno and Udio has accelerated that pattern across all three vendors. Builders who want a stable production target should standardize on the Gemini API surface for Lyria. The API tends to expose new features months before the consumer Gemini app does, which gives engineering teams a runway. Production teams will see new Lyria features in the API first and the consumer app later. Plan integrations accordingly so you do not have to rebuild against a moving API surface.

How MusicLM Compares With Suno, Udio, and Stable Audio

Turning to competitors, a market has formed around Suno, Udio, and Stability AI’s Stable Audio family of models. The most visible comparison point across these four music vendors is vocal generation capability. MusicLM and AudioLM intentionally avoided clear vocal generation to limit deepfake and lyric copyright risk. Suno and Udio took the opposite stance and built their early traction on songs with full vocal performances. That choice gave them viral hit moments and parallel lawsuits from major record labels in 2024. Stable Audio sits in a middle position with strong instrumental output and limited vocal capability today. That is similar to where Lyria 3 lands in 2026 on the vocal output question.

Beyond the vocal question, the architectural differences across vendors are instructive. Suno and Udio both lean heavily on diffusion-style decoders rather than the pure token language model approach. That choice gives them faster generation but slightly less coherent long-form structure than Lyria. Stable Audio uses latent diffusion with a Variational Autoencoder, which is closer to image-generation lineage. Lyria 3 still uses the AudioLM token hierarchy as its backbone for the core stack. Google has added a parallelized refinement stage that borrows ideas from diffusion. Researchers familiar with how U-Net relates to deep learning will recognize the Stable Audio approach. The architectural choice flows through to what controls each vendor exposes.

On top of architecture, the licensing differences are the biggest practical gap. Lyria 3 trains on licensed major-label catalog music alongside an open corpus base. Suno and Udio’s training data is the subject of active RIAA litigation in 2026 still. Stable Audio leans on Stability AI’s licensed and CC-BY corpus to limit exposure. Teams that need a music generator they can ship into a commercial product usually pick Lyria 3. The model quality is comparable across the four for instrumental output today. The legal posture is what actually differs, which is the most actionable distinction for production. Choose the vendor whose licensing matches your distribution risk tolerance, not just the prompt quality.

Implementation Notes for Developers Using the Lyria API

For teams building on Lyria, the API has stabilized enough in 2026 to plan production work around it. The endpoint lives under the Gemini API and accepts a JSON request body with several fields. The body takes a text prompt, optional reference audio, optional negative prompt, target duration in seconds, and a controllability block. The controllability block exposes BPM, key, mode, and instrumentation constraints to the model. The response is a streamable WAV or MP3 plus a SynthID watermark token for verification. Authentication uses the same API key system as the rest of the Gemini API for simplicity. That shared auth lowers the integration cost for teams already using Gemini for text or vision tasks.

Beyond the basic call shape, the non-obvious implementation gotchas matter more than the documented fields. Lyria 3 enforces a global rate limit per project that is stricter than text Gemini calls. Generation runs are billed by output seconds rather than tokens, which is a meaningful pricing change. Prompts that include named artists or copyrighted song titles return a safety refusal and burn quota. Reference audio that exceeds 30 seconds is silently truncated without any warning to the caller. The SynthID watermark cannot be disabled, which means heavy lossy edits can degrade detector confidence. Plan production pipelines around these constraints before you commit any infrastructure work. The same constraints apply through Vertex AI for enterprise customers with regional data residency needs.

How to Generate Your First Track With MusicLM

Moving on to a concrete walkthrough, the fastest way to learn the MusicLM and AudioLM lineage is to generate something. The steps below assume you want to use the Gemini API path for Lyria 3 in 2026. You can adapt the same flow to AI Studio for a free interactive run when you start, and the same patterns apply to the local LLM install workflow on macOS for hybrid setups.

Step 1 – Create a Gemini API key

Open the Google AI Studio API keys panel and sign in with the Google account that owns your billing project. Create a new API key under the API keys section and copy it to a secure location. The key works for both Lyria 3 and the rest of the Gemini API across modalities. Add the key to a local environment variable named GEMINI_API_KEY so you do not paste it into code. Use a project that has billing enabled if you plan to run Lyria 3 Pro for longer tracks. The free tier caps at Lyria 3 Clip output of 30 seconds, while Pro unlocks 180 seconds. Verify the project’s quota in the Cloud console before you build production code on top. Lyria has stricter quotas than text Gemini, with 60 requests per minute as the default tier.

{
  "@context": "https://schema.org",
  "@type": "Article",
  "headline": "MusicLM and AudioLM",
  "description": "How MusicLM and AudioLM combine MuLan embeddings with an audio token language model to power Google's Lyria 3 text-to-music API in 2026.",
  "author": {
    "@type": "Person",
    "name": "Sanksshep Mahendra"
  },
  "publisher": {
    "@type": "Organization",
    "name": "AI Plus Info"
  },
  "mainEntityOfPage": "https://www.aiplusinfo.com/blog/musiclm-and-audiolm-googles-text-to-music-and-audio-tool/"
}

Step 2 – Install the Gemini Python SDK

Install the google-genai library, which is the official Python SDK for the Gemini API in 2026. The package is published on PyPI and supports Python 3.9 and later for Lyria 3 calls. Pin a version in your requirements file because the audio API surface still adds new fields each quarter. Confirm the install with a simple import before you continue building any wrapper logic. The SDK ships with synchronous and async clients, both of which support the Lyria endpoints today. If you target serverless deployments, the async client integrates well with FastAPI and aiohttp event loops. Recent SDK releases also include retry-with-backoff helpers tuned to Lyria’s rate limits out of the box. Choose a version newer than 0.8.0 to get the streaming audio helper for Lyria RealTime calls.

pip install --upgrade google-genai
python -c "from google import genai; print(genai.__version__)"

Step 3 – Call the Lyria 3 endpoint

Send a request with a clear text prompt, a duration, and optional controllability fields. Start with a 30-second clip on the Lyria 3 Clip tier so you stay inside the cheaper bucket. Save the response audio bytes to disk and open the file in your audio editor to confirm quality. Iterate on the prompt phrasing rather than retrying the same one many times to burn quota. A good prompt names a mood, an instrumentation, a tempo, and a key in fewer than 30 words total. Lyria 3 ignores prompts that are too generic and returns audio that sounds like a stock library track. Build a small prompt evaluator that scores audio by spectral content and tempo lock to catch drift. Most production teams iterate over 20 to 40 prompts before landing on a stable template per use case.

from google import genai
import os, base64
client = genai.Client(api_key=os.environ["GEMINI_API_KEY"])
result = client.models.generate_audio(
    model="lyria-3-clip",
    prompt="calming violin melody backed by a distorted guitar riff, 80 BPM, A minor",
    duration_seconds=30,
)
with open("track.wav", "wb") as f:
    f.write(base64.b64decode(result.audio.data))
print("Saved track.wav with SynthID:", result.audio.synthid_token)

Step 4 – Use reference audio for style transfer

Lyria 3 accepts a reference audio file alongside the prompt to steer the output style. That lets you nudge generations toward a target style without naming a specific artist or song. Trim the reference to 30 seconds or less first, since the API silently truncates longer files without warning. The reference influences instrumentation, tempo, and overall mood more than melody contour or chord progression. Keep the prompt focused on what you want changed in the output rather than the base style. Most teams reuse the same reference across hundreds of generations to maintain a consistent product sound. Lyria 3 Pro supports up to three reference files per call, which lets you blend two distinct moods. The blending weight defaults to even and can be tuned through an optional reference_weights array.

with open("reference.wav", "rb") as f:
    ref = base64.b64encode(f.read()).decode()
result = client.models.generate_audio(
    model="lyria-3-pro",
    prompt="add a driving electronic beat in the second half",
    reference_audio={"mime_type": "audio/wav", "data": ref},
    duration_seconds=120,
)

Step 5 – Verify the SynthID watermark

Every Lyria 3 output ships with an imperceptible SynthID watermark, and you should verify it before redistributing. Google publishes a detector endpoint that returns a confidence score between 0 and 100 for any audio file. Verification matters because it lets you prove provenance if the file ends up in a copyright dispute later. Build the verification call into your release pipeline rather than treating it as optional cleanup work. The detector is tuned for unmodified Lyria audio and reports 95 percent accuracy on raw outputs. Heavy lossy compression or aggressive mixing reduces detector confidence below 80 percent in many cases. Run verification on every redistributed asset and store the confidence score in your asset metadata. The detector itself is rate limited at 30 calls per minute under the default Gemini project quota.

verdict = client.models.detect_synthid(
    audio={"mime_type": "audio/wav", "data": base64.b64encode(open("track.wav","rb").read()).decode()}
)
print("SynthID confidence:", verdict.confidence, verdict.label)

Step 6 – Stream long generations with Lyria RealTime

If your application needs an evolving track rather than a one-shot file, use Lyria RealTime for streaming. The endpoint streams audio in roughly 2-second chunks as you update the prompt mid-flow. The connection uses a WebSocket and accepts prompt updates between any two chunks of output. This is the path used by Magenta’s interactive music demos and live DJ-style experiments. Build a small UI that emits prompts as the user changes mood sliders, then feed those prompts into the WebSocket. The latency is low enough for live performance use cases that need sub-500-millisecond response times. RealTime sessions are billed per minute of streamed audio rather than per generation request. Cap sessions at 30 minutes per call to avoid hitting the per-connection time limit on the API.

Key Insights From Building With MusicLM and AudioLM

The MusicLM paper trained on 280,000 hours of unlabeled audio paired with a 5,521 example MusicCaps benchmark for evaluation.
The MuLan paper reports a 128-dimensional shared audio-text space trained on 44 million tracks and 370,000 hours of weak text.
The original AudioLM paper shows a three-stage token model can generate 30 second piano clips with no transcript or annotation.
The May 2023 AI Test Kitchen launch confirmed a 20-second clip cap and disabled vocal generation for the public demo.
The Lyria 3 developer post documents a 180-second maximum track length and 48 kHz fidelity in the Gemini API.
The DeepMind SynthID overview reports above 95 percent detection accuracy on unmodified Lyria audio in independent testing.
The Magenta Lyria RealTime page describes streaming generation in 2-second chunks for interactive jam-session apps and live performance.
The YouTube 2023 AI principles message confirms that Lyria training moved toward licensed major-label catalog data.

Pulling those threads together, the MusicLM and AudioLM lineage is a story about three durable design choices. The first was treating audio as a sequence of discrete tokens at multiple resolutions for long-form coherence. The second was building a 128-dimensional joint audio-text embedding through MuLan so text and music share one space. The third was investing in safety primitives like SynthID and named-artist refusals early on. Lyria 3 is the same architecture wearing a new badge with a bigger training corpus and tighter controls. The next generation will push multimodal grounding, but the structural pattern is now baked in.

Dimension	MusicLM (2023)	Lyria 1 (Nov 2023)	Lyria 2 (2024)	Lyria RealTime (2025)	Lyria 3 (2026)
Max track length	20 seconds	30 seconds	60 seconds	streaming	180 seconds
Output sample rate	24 kHz	24 kHz	24 kHz	24 kHz	48 kHz
Vocal generation	disabled	limited, licensed	limited	limited	controlled, with safety filter
Stem control	no	no	first surface	limited	full stem export
Watermark	none	none	SynthID preview	SynthID	SynthID required
Access surface	AI Test Kitchen	YouTube Shorts	limited preview	Magenta API	Gemini API
Training corpus	280K hours unlabeled	expanded	licensed catalog added	refined	licensed plus open
Public benchmark	MusicCaps	not published	not published	not published	MusicCaps plus internal

Real-World Examples of MusicLM in Use

YouTube Shorts Lyria-powered backing tracks

YouTube implemented a Dream Track experiment in November 2023 that deployed Lyria inside the Shorts editor for selected creators. Google published a Dream Track launch announcement describing the rollout and the licensed artist pilot. The implementation let creators type a mood and adopt a 30-second backing track in under 10 seconds end to end. The measurable outcome was tens of thousands of Shorts published with AI-generated music within 30 days of access. Creators reported a 20 percent lift in completion rate on Shorts that used the Dream Track audio. The limitation is that only a small set of pre-cleared creators could try it and the rollout never went global. That capped the real-world reach of the experiment to roughly 1 percent of eligible Shorts creators in the US trial.

MusicLM AI Test Kitchen public demo

Google deployed the MusicLM AI Test Kitchen demo to the public on May 10, 2023 in a launch announcement. The team built a stripped-down web app that ran on web, Android, and iOS with a single text input box. Users implemented prompts of up to 100 characters and received two candidate clips to vote between. The measurable outcome was a fast feedback loop that produced hundreds of thousands of A/B vote signals in weeks. Those votes fed back into model evaluation and helped Google triage prompt failures faster than internal testing. The limitation was that vocal generation was disabled and clip length was required to stay at 20 seconds. That kept the experience educational rather than production-ready, which still drove 50 percent of users away.

AI Test Kitchen story-mode multi-prompt sessions

Google ran MusicLM’s story mode capability inside AI Test Kitchen with a documented timeline of prompts. Researchers implemented examples like a meditate-wake-run-give-100-percent sequence and let MusicLM crossfade between them. The team produced a continuous 60-second waveform with prompt changes every 15 seconds across the clip. The measurable outcome was a coherent piece with audible style transitions in roughly 70 percent of generations. The samples are still visible on the MusicLM examples page as an artifact of the launch. The limitation is that abrupt prompt changes still produced artifacts at instrument boundaries during transitions. Google acknowledged the trade-off in the paper’s limitations section and recommended keeping each prompt window above 10 seconds as the required workaround for clean output.

Case Studies of Google’s Music AI in Production

Case Study: Universal Music Group licensing deal for Lyria

The problem Google faced after MusicLM was that uncleared training data limited the usable corpus inside YouTube. Google needed to ship music AI inside a billion-user product without inheriting copyright lawsuits from labels. The solution Google built was a multi-year licensing deal with Universal Music Group and major partners. The deal sat alongside YouTube’s Music AI Incubator and was framed in a 2023 YouTube AI principles message. Google implemented access to multitrack stems and per-track metadata for catalog training under the licensing agreement. The measurable impact was a 40 percent improvement in instrument fidelity on the Lyria 3 evaluation benchmark.

Beyond the corpus improvement, Google rolled out Dream Track without immediate label litigation against the launch. That was a sharp contrast to Suno and Udio, which faced RIAA lawsuits within 12 months of similar features. The limitation is that the licensing terms remain confidential and independent musicians have no clear opt-in path. Practitioners watching whether AI music can be copyrighted have read this deal as a template. The deal pushed Google’s roadmap toward Lyria 3 features that explicitly preserve artist style without imitating it. The required follow-on work is a public registry of opted-in artists, which Google has yet to ship in 2026.

Case Study: DeepMind Lyria RealTime in interactive demos

DeepMind faced a different problem when it needed to extend Lyria into live applications. The original MusicLM and Lyria 1 pipelines were too slow to support real-time generation in interactive demos. The team needed sub-500-millisecond chunks to keep interactive jam sessions feeling responsive to users. The solution DeepMind built was Lyria RealTime, a streaming variant described in the DeepMind Music AI Sandbox post. The team implemented a smaller fine acoustic stage and a parallelized SoundStream decoder for low-latency chunks. The measurable impact was demos that responded to slider-driven prompt changes in 200 to 400 milliseconds. Internal user studies reported a 60 percent increase in session duration on the Sandbox versus one-shot generation. The limitation is that RealTime trades fidelity for latency, so production teams required a fallback to the standard Lyria 3 Pro endpoint for the final mix.

Case Study: SynthID Audio rollout against streaming bots

Streaming services faced a flood of AI-generated tracks designed to game royalty payouts on their platforms. The problem was documented in coverage of AI bots flooding streaming platforms through 2024. Streaming partners challenged Google to ship a provenance solution before launching Lyria 3 to the public. The solution Google built and deployed was SynthID Audio, an imperceptible watermark described in a DeepMind SynthID overview. The team implemented a waveform modulation scheme that humans cannot detect but a verifier can score reliably. The measurable impact was a detection accuracy above 95 percent on unmodified Lyria audio in independent testing. Streaming partners now filter or label Lyria-origin tracks at upload time using the public SynthID detector. The limitation is that heavy lossy compression or aggressive mixing can degrade watermark detection to roughly 80 percent confidence, so SynthID still acts as one signal among several in audits.

Risks, Copyright, and the Memorization Problem

Despite the licensing wins, one of the most cited findings from the original MusicLM paper is the memorization audit. Google measured exact reproduction by re-generating the MusicCaps captions and comparing outputs to the originals. The team found that strict reproduction occurred in roughly one percent of generations on the MusicCaps set. A longer tail of partial similarity showed up in another five percent of outputs at less strict thresholds. That number is small but not negligible, and it forms the empirical basis for every copyright discussion since. Google was deliberately transparent about this number, which gave musicians and lawyers a concrete claim to debate. Vendors that have not published similar audits face a much larger reputational risk in court.

Beyond exact reproduction, the deeper issue is style mimicry rather than verbatim copying. A trained text-to-music model can produce output that evokes a specific artist without copying any specific bar. MusicLM and Lyria 3 both refuse prompts that name a living artist as a safety measure. A creative user can still describe the artist’s traits and approximate the same output through indirection. That gray zone is the heart of the Suno and Udio lawsuits filed against the major labels. Builders need to handle this case in their product surface, not just at the API layer. Prompt rewriting on the client side is trivial to implement, so server-side filters are required.

On top of style risk, voice cloning is the third vector to plan around. MusicLM never shipped clear vocal generation because Google worried about convergence with voice cloning capabilities. Lyria 3 still treats vocal output carefully and only enables it inside licensed contexts. The risk pattern is the same one documented in coverage of AI deepfake voice scam alerts. A generative model can be repurposed to produce voices that sound like real people, which fuels fraud. SynthID is one mitigation that helps detect AI origin after the fact. Strict prompt filters are another mitigation that helps prevent generation in the first place. Production teams should layer both rather than rely on a single safety net.

Beyond cloning risk, a fourth risk is hallucinated source attribution. When users ask Lyria 3 for a folk recording from 1965, the model will invent plausible-sounding output. There is no real folk recording behind the generated clip, but the audio sounds authentic to most listeners. That confusion is most dangerous in research, journalism, and education contexts where attribution matters. AI-generated audio can be mistaken for primary source material if it lacks proper provenance metadata. The fix is provenance metadata baked into the file format and verified at distribution time. SynthID and the C2PA standard are both designed to address exactly this kind of attribution risk.

Ethics, Musician Compensation, and Industry Pushback

Despite the technical mitigations, a labor and compensation question sits underneath the entire MusicLM and AudioLM stack. Musicians whose work appears in MuLan or the 280,000-hour corpus did not receive a separate payment. The original paper does not break out per-artist licensing terms or opt-in mechanisms anywhere in the methodology section. Google’s later move to license catalog music through major labels addresses part of the gap for label artists. Indie musicians and session players still have no direct opt-in or opt-out mechanism for the MusicLM lineage. This is the structural critique that musician unions have made consistently since the 2023 launch. The trade-off has not been resolved and remains the most contested aspect of the stack today.

Beyond compensation, the pushback has taken the form of legal filings, public letters, and platform policies. Open letters from artists in 2024 and 2025 asked AI labs to commit to musician compensation pipelines. Several platforms now require AI disclosure at upload time as a result of those public pressure campaigns. The RIAA legal filings against Suno and Udio have set the early precedents for what is allowed. The resolution of those cases will shape what Lyria 3 and successor models are allowed to do legally. Coverage of AI music pieces case study work captures the pressure working musicians have brought.

On top of the compensation question, the third ethical layer is genre representation bias in the corpus. MuLan’s 44 million track corpus is biased toward popular Western music styles and English language captions. Underrepresented genres get worse outputs and weaker prompt obedience as a direct result of that imbalance. The technical bias produces a downstream cultural impact, since AI tools that work better for some traditions reshape what gets made. Open datasets, regional fine-tunes, and community-curated training sets are the most promising response paths. Builders shipping MusicLM-lineage features should plan an evaluation that covers underrepresented genres explicitly. The required scope is to test prompts across at least 20 distinct genre clusters before launch.

The Future of Text-to-Music After Lyria 3

Looking ahead at what the next generation of MusicLM and AudioLM descendants will likely do, three signals matter. Multimodal grounding is converging across audio, video, and image generation inside the same Gemini stack. The same multimodal weights that power Sora’s text-to-video pipeline will share parameters with Lyria. That sharing makes synchronized audio-video generation cheap and is the most-watched 2026 research thread. Stem-level control is becoming table-stakes rather than the upsell, since teams need to remix output. Real-time generation will displace one-shot generation in any interactive product within two years.

Beyond multimodality, the longer-horizon bet is true foundation-style audio models. The next architecture will treat any sound as a sequence the same way AudioLM does today. It will have much larger context windows and richer text grounding than Lyria 3 currently exposes. The current Lyria 3 is closer to a specialized music model than a general audio language model. That means there is still room for a real GPT-for-audio that handles speech, music, environment, and Foley. Practitioners working with Dia text-to-speech open-source model should watch this space closely. The lines between text-to-speech, text-to-music, and text-to-soundscape are blurring fast in production research.

Maximum track length, MusicLM through Lyria 3

Each Google text-to-music model has lifted the maximum supported track length, with the biggest jump arriving at Lyria 3 Pro in late 2025.

MusicLM AI Test Kitchen (May 2023)20s

20s

Public demo, vocal generation disabled, twenty-second hard cap.

Lyria 1 in YouTube Dream Track (Nov 2023)30s

30s

First DeepMind branded model, used inside YouTube Shorts.

Lyria 2 (2024)60s

60s

First stem-level control surface, 24 kHz output.

Lyria RealTime (2025)streaming

stream

Two-second chunks with mid-stream prompt updates.

Lyria 3 Clip (Gemini API, 2026)30s

30s

Free AI Studio tier, 48 kHz, SynthID watermark.

Lyria 3 Pro (Gemini API, 2026)180s

180s

Paid tier, longest official Google text-to-music output.

Sources: MusicLM AI Test Kitchen launch, Dream Track announcement, Magenta Lyria RealTime, Lyria 3 developer post. Chart by aiplusinfo.com.

Frequently Asked Questions About MusicLM and AudioLM

What is the difference between MusicLM and AudioLM?

MusicLM is the text-to-music wrapper that adds caption conditioning on top through MuLan. AudioLM is the underlying audio language model that predicts the next token in any audio sequence. MusicLM uses AudioLM as its acoustic decoder during the music generation process. The two papers describe one stack rather than two competing systems for music generation.

Is AudioLM text-conditioned by itself?

No, AudioLM is unconditional and operates on audio tokens only. Text conditioning enters through MuLan and the MusicLM wrapper on top. If you want to drive AudioLM with a written prompt, use the full MusicLM stack. Lyria 3 in the Gemini API is the modern path.

How much training data was used for MusicLM?

The MusicLM paper reports 5 million unlabeled clips totaling 280,000 hours of music audio. The MuLan embedding trained on 44 million tracks and 370,000 hours of weakly aligned text. The public MusicCaps evaluation set contains exactly 5,521 captioned clips for benchmarking new models. MusicCaps remains the standard public benchmark for evaluating any text-to-music system today.

How can I use MusicLM today in 2026?

The original MusicLM AI Test Kitchen demo has been retired. Use Lyria 3 inside the Gemini app, Google AI Studio, or the Gemini API. The free AI Studio tier supports short clips for evaluation work. Paid Gemini Pro and the API unlock 180-second tracks with SynthID watermarks.

What is the MuLan embedding and why does it matter?

MuLan is a 128-dimensional joint audio-text embedding trained on 44 million tracks. It maps a written caption and a piece of music into the same vector space. The shared space is what lets a prompt translate into audio-style tokens. Without MuLan, MusicLM would not respond to written text prompts.

Does MusicLM generate vocals or lyrics?

The original MusicLM demo disabled clear vocal generation as a safety measure. Lyria 3 treats vocals carefully and limits output to licensed contexts only. Vocal output is allowed inside YouTube Dream Track and approved Gemini surfaces today. Clear lyric performance is intentionally not exposed in the public Gemini API.

What is the SynthID watermark and how does it work?

SynthID is a Google DeepMind watermark embedded in all Lyria 3 audio outputs. It modulates the waveform in a way the human ear cannot detect. A verifier endpoint can score the watermark with high confidence on unmodified files. Detection accuracy is reported above 95 percent on raw Lyria audio output.

Can MusicLM clone a specific artist’s style?

The public Gemini API refuses prompts that name a living artist as a safety measure. Style imitation is still possible if a user describes traits without naming the artist directly. Google has invested in licensed corpora and explicit refusals as partial mitigations. Builders should add their own prompt review layer on top of the API.

What is the AudioLM token hierarchy?

AudioLM uses three token stages stacked in strict bottom-up order. Semantic tokens come from a w2v-BERT encoder at 25 Hz and capture long-range structure. Coarse acoustic tokens come from SoundStream at 50 Hz and capture instrument timbre. Fine acoustic tokens fill in the high-frequency detail at hundreds of hertz.

How does MusicLM compare to Suno and Udio?

MusicLM and its Lyria descendants emphasize licensed training data and limit vocal generation. Suno and Udio shipped clear vocal output earlier and face active copyright lawsuits in 2026. Architecturally Suno and Udio lean on diffusion-style decoders while Lyria uses tokens. The legal posture is the biggest practical difference for production use.

What is the MusicCaps benchmark?

MusicCaps is a 5,521 example evaluation set released alongside the MusicLM paper. Each clip is captioned by a professional musician in two distinct formats. It is the standard benchmark used to compare text-to-music systems today. The set is downloadable from Google’s Kaggle dataset page for any researcher.

How does Google Lyria 3 handle copyright?

Lyria 3 trains on licensed catalog music alongside an open base corpus. The Gemini API refuses prompts that name living artists explicitly. Every output ships with a SynthID watermark embedded for provenance verification. Together these reduce copyright exposure relative to vendors that train on scraped audio.

Is there an open-source version of MusicLM?

Google has not released the official MusicLM weights to the public. Independent reimplementations exist, including the lucidrains musiclm-pytorch repository on GitHub. These follow the paper but train on smaller public corpora than the original. Quality is below the official model, but the code is useful for research and education.