The multimodal AI trade-off for communications leaders

Unprecedented creative power can drive breakthrough storytelling — or pull organizations into a new era of mediocrity.

Martin Waxman

Jan. 15, 2026

$This story is brought to you by Ragan\'s Center for AI Strategy. Learn more by visiting ragan.com/center-for-ai-strategy$ $This story is brought to you by Ragan\'s Center for AI Strategy. Learn more by visiting ragan.com/center-for-ai-strategy$

Martin Waxman is associate director, Future of Marketing Institute. He is also an advisor at the Center for AI Strategy.

Let’s face it. When communicators think about storytelling, the first thing that comes to mind is words. Yet, text alone is only part of the equation. To effectively reach your audience, you need to incorporate other modes of media, too.

And that’s where multimodal AI comes in.

What is multimodal AI?

Multimodal AI is a system that can process more than one type of data. It combines natural language processing (the connection between words) and computer vision (the visuals a machine can see).

In the early days of chatbots (i.e. circa 2023), generative AI could ‘understand’ either text or visuals separately, but not both together. You had ChatGPT (text) and Dall-E (images) as two separate apps. Each operated in a different way.

Today, many AI systems, like ChatGPT, Gemini, Claude and Grok can grasp the relationship between images and text. They can also make sense of video, spreadsheets, computer code, audio, design, music, languages …

In some ways, this is similar to the way you and I perceive the world. That’s because combining words and visuals in novel ways is how we learn, understand, create and communicate.

If machines can do that as well as or better than most people, where does that leave professionals like you and me?

Still at the starting gate

Right now, we’re in the early stages of multimodal AI. And when you first encounter AI ‘productions,’ they almost seem like magic.

Want to create a social media video? Try Sora, Invideo, Google Veo or RunwayML. Visually describe your idea when you enter your prompt or paste in your script. Then, sprinkle in some visuals and you’re on your way.

Need a song or some music for a video or podcast? Explain your concept and style to Suno or Udio and in minutes, you have a fully-orchestrated tune.

Want your voice translated into a language you don’t speak? Eleven Labs can help you out.

What about a landing page or website? Just share your goal and brief with WordPress or Wix and watch it appear before your eyes.

Or conduct research, read, summarize and spot patterns in the many open tabs you have on your browser. Dia, Perplexity’s Comet and ChatGPT Atlas browsers let you do that right now.

Multimodal AI will help you turn AI agents from text-based chatbots to embodied synthetic characters with expressive faces and voices that look at you, speak and collect data in real time. You can see an early example of what that might look from the startup D-iD.

A creative department of one?

With multimodal AI, you can take any idea in your head and instantly bring it to life.

Yet the process is so simple and fast, it might inadvertently reduce the time you need to reflect on the substance of your idea or its viability.

And you’ll need to balance the trade-off between quality and speed and ensure your judgment pushes you to the high-quality side.

If you’re a digital communications or marketing professional, your talents just grew exponentially. You have access to all to tools you need to perform tasks that required teams of people to complete.

And because most marketing and comms outputs tend to be “good enough,” you may feel multimodal AI tools are all you need to get the job done.

The risk with this approach is that you fall into the trough of formulaic output.

Remember the days of desktop publishing? Sure, anyone could design a page. But most of them were barely competent since the people using the tools had little or no graphic sense.

We had tools we could use to make an approximation of a great design. A somewhat reasonable facsimile. But the finished product lacked the nuance and sparkle of the real thing.

Multimodal AI tools can’t replace vision and talent

Standout content requires more than competence.

You need original ideas, artistry and craft and the ability to put it all together in a surprising and memorable way.

Subject matter expertise doesn’t happen overnight. Aesthetics, experience and human discernment play a big role, too.

Multimodal tools become most powerful when leaders work shoulder to shoulder with visual and creative specialists. Together, you can use AI to explore options, test ideas, and accelerate production, while applying shared judgment every step of the way. In this approach, AI doesn’t simply replace craft or creative leadership; it strengthens the partnership between them.

Learn more about the Center for AI Strategy.