Skip to main content

Nvidia’s new AI model makes music from text and audio prompts

Nvidia logo.
Nvidia

Nvidia has released a new generative audio AI model that is capable of creating myriad sounds, music, and even voices, based on the user’s simple text and audio prompts.

Recommended Videos

Dubbed Fugatto (aka Foundational Generative Audio Transformer Opus 1) the model can, for example, create jingles and song snippets based solely on text prompts, add or remove instruments and vocals from existing tracks, modify both the accent and emotion of a voice, and “even let people produce sounds never heard before,” per Monday’s announcement post.

“We wanted to create a model that understands and generates sound like humans do,” said Rafael Valle, a manager of applied audio research at Nvidia. “Fugatto is our first step toward a future where unsupervised multitask learning in audio synthesis and transformation emerges from data and model scale.”

The company notes that music producers could use the AI model to rapidly prototype and vet song ideas in various musical styles with varying arrangements, or add effects and additional layers to existing tracks. The model could also be leveraged to adapt and localize the music and voiceovers of an existing ad campaign, or adjust the music of a video game on the fly as the player plays through a level.

The model is even capable of generating previously unheard sounds like barking trumpets or meowing saxophones. In doing so, it uses a technique called ComposableART to combine the instructions it learned during training.

“I wanted to let users combine attributes in a subjective or artistic way, selecting how much emphasis they put on each one,” Nvidia AI researcher Rohan Badlani wrote in the announcement post. “In my tests, the results were often surprising and made me feel a little bit like an artist, even though I’m a computer scientist.”

The Fugatto model itself uses 2.5 billion parameters and was trained on 32 H100 GPUs. Audio AI’s like this are becoming increasingly common. Stability AI unveiled a similar system in April that can generate tracks up to three minutes in length while Google’s V2A model can generate “an unlimited number of soundtracks for any video input.”

YouTube recently released an AI music remixer that generates a 30-second sample based on the input song and the user’s text prompts. Even OpenAI is experimenting in this space, having released an AI tool in April that needs just 15 seconds of sample audio in order to fully clone a user’s voice and vocal patterns.

Andrew Tarantola
Former Digital Trends Contributor
Andrew Tarantola is a journalist with more than a decade reporting on emerging technologies ranging from robotics and machine…
Google’s new Gemma 3 AI models are fast, frugal, and ready for phones
Google Gemma 3 open-source AI model on a tablet.

Google’s AI efforts are synonymous with Gemini, which has now become an integral element of its most popular products across the Worksuite software and hardware, as well. However, the company has also released multiple open-source AI models under the Gemma label for over a year now.

Today, Google revealed its third generation open-source AI models with some impressive claims in tow. The Gemma 3 models come in four variants — 1 billion, 4 billion, 12 billion, and 27 billion parameters — and are designed to run on devices ranging from smartphones to beefy workstations.
Ready for mobile devices

Read more
Watch this AI-driven Maserati go insanely fast for new speed record
An AI-driven Maserati breaking a speed record in 2025.

An empty Maserati MC20 driven by an AI system recently set a new speed record for an autonomous vehicle, reaching a blistering 197.7 mph (318 kph) at Space Florida's Launch and Landing Facility at the Kennedy Space Center in Florida.

Footage of the achievement (above) shows the self-driving Maserati MC20 hurtling down the runway once used for Space Shuttle landings, with the speedometer gradually ticking all the way up to the record-breaking speed.

Read more
OpenAI launches GPT-4.5 AI model with deeper knowledge and emotions
Announcement artwork for GPT 4.5 AI model

OpenAI has just introduced its latest AI model, dubbed GPT-4.5, which the company claims is its largest and best model yet. This is not a reasoning model, a faculty which is limited to the O-series models. Despite that, GPT-4.5 is touted to be a more natural conversationalist with a higher emotional quotient and improved problem-solving capabilities.

As far as core competencies go, it has access to the latest information from the web, allows file and multimedia upload, and supports canvas platform for coding-related tasks. However, it currently lacks support for voice mode, video comprehension, and screen-sharing.

Read more