Fugatto AI Model From NVIDIA Can Generate Music, Background Audio and Voice

NVIDIA has introduced Fugatto, a new generative AI model that could revolutionise how we interact with sound. Dubbed the “Swiss Army knife” of audio, Fugatto blends music, voices, and sound effects seamlessly, offering creators an unprecedented level of control over their audio projects. From composing music to altering voice accents and emotions, Fugatto is pushing the boundaries of what’s possible with AI in sound design.

What Sets Fugatto Apart?

While existing AI tools can generate songs or modify voices, Fugatto’s versatility is unmatched. This foundational generative audio transformer can mix text and audio prompts to create or transform soundscapes in ways previously thought impossible. Imagine crafting a symphony of barking trumpets or composing a melody that transitions into a dawn chorus of chirping birds—Fugatto makes it happen.

Multi-platinum producer and songwriter Ido Zmishlany, co-founder of One Take Audio, described the technology as a creative game-changer. “With AI, we’re writing the next chapter of music. It’s a new instrument, a new tool for making music—and that’s super exciting,” he said.

Breaking Down Fugatto’s Capabilities

At its core, Fugatto enables users to:

  • Compose Original Audio: Generate music, sound effects, or even singing voices based on text prompts.
  • Transform Existing Sounds: Add or remove instruments, change the mood of a song, or modify voice accents and emotions.
  • Create Dynamic Soundscapes: From thunderstorms fading into birdsong to custom voiceovers tailored to any scenario.

Rafael Valle, NVIDIA’s applied audio research manager and a co-creator of Fugatto, explained, “We wanted to create a model that understands and generates sound like humans do.”

One of Fugatto’s standout features is temporal interpolation, which allows sounds to evolve over time. Think rainstorms that crescendo with thunder before tapering off or a soundscape that changes as if it were alive.

How It Works

Fugatto is powered by a 2.5-billion-parameter generative transformer model, trained on NVIDIA DGX systems with 32 H100 Tensor Core GPUs. The training process relied on millions of diverse audio samples, which gave Fugatto its multilingual and multi-accent capabilities.

The model uses a technique called ComposableART to combine instructions from different prompts. For example, users can create text spoken with a French accent and a sad tone, tweaking the intensity of each attribute to suit their needs.

Creative Possibilities Across Industries

The use cases for Fugatto extend beyond music production.

  • Ad Agencies can localize campaigns with voiceovers in different accents or emotions.
  • Education Tools could feature personalized audio, like courses spoken in the voice of a family member.
  • Game Developers can generate dynamic audio assets to adapt to real-time gameplay.

Zmishlany draws parallels between Fugatto and transformative technologies of the past. “The electric guitar gave us rock and roll. The sampler birthed hip-hop. With Fugatto, AI is our next big instrument.”

From Concept to Creation

The project brought together researchers from across the globe, from India to Brazil. The team’s breakthrough moment came when Fugatto created music entirely from a text prompt. Another highlight? A demo of electronic beats synced with barking dogs that left the team in stitches.

“We’re proud of what we’ve built,” said Valle. “Fugatto isn’t just about sound—it’s about empowering creators to push their artistic boundaries.”

Leave a Reply