Rethinking Inference for Diffusion

Published on

Jan 5, 2026

Written by

Cagla Kaymaz

Myriad Founders
Myriad Founders
Myriad Founders

AI-generated images and videos are getting remarkably good and increasingly difficult to spot. As quality continues to improve, the bottleneck is shifting away from quality toward speed and cost to serve these models. Today, media generation is simply too slow, limited, and expensive for widespread adoption. For example, Google’s Veo-3 caps video generation at 8 seconds and costs $0.40 per second, while OpenAI’s Sora 2-Pro supports up to 25 seconds at $0.50 per second, both take minutes to generate just seconds of video.

At the moment, performance at the frontier is dominated by proprietary models. There are no open-source text-to-image or text-to-video models that match the quality of the leading closed source models. In fact, the top ten text-to-image and text-to-video models on public leaderboards are all proprietary. 

But this won’t last. 

Image and video diffusion models are approaching their own “DeepSeek moment”

In early 2025, DeepSeek R-1 marked a turning point for reasoning models. DeepSeek R-1 was the first open weights model to publicly match OpenAI’s o1 performance, beating larger and better-funded efforts from Google and Meta to the milestone. Despite hyperscalers’ massive budgets, R-1 was widely believed to have been trained for a fraction of the cost while achieving comparable performance with a much smaller model. What followed was a rapid shift: competition intensified, investment across both open and proprietary ecosystems accelerated, AI became more accessible, and real-world usage surged. As adoption increased, the flywheel took hold, and the LLM inference market exploded.

When, not if, open-source diffusion models that power image and video generation cross a similar quality threshold, there will be a reinforcing cycle of investment and adoption across both open and closed ecosystems. Capturing that rapid adoption curve will require a new inference stack, purpose-built to serve diffusion models efficiently. 

We need a media-focused inference stack

The inference infrastructure that powers today’s AI systems is largely designed for text. Autoregressive models used for text generation produce output one token at a time. Common optimization techniques like batching requests, KV caching (reusing past context), and optimized token scheduling work exceptionally well for text generation because LLMs continually reference earlier tokens in the sequence. Many of the abstractions and optimizations that make text generation performant do not translate well to media generation.

Diffusion models powering image and video generation work fundamentally differently. Instead of producing output sequentially, they iteratively refine entire image or video tensors over many denoising steps. There is no long token sequence to cache and reuse. Computation operates over large, dense tensors, and performance is dominated by GPU memory movement (reading and writing these tensors) rather than token reuse. 

For video, models must process multiple frames together, multiplying memory requirements and demanding tight coordination across GPUs. Unlike LLMs, where GPU capacity can be efficiently shared across many concurrent requests, video generation jobs tend to monopolize the GPU resources they run on for the duration of the computation.

As a result, serving video models is significantly more resource-intensive and far harder to optimize than LLMs.

While there are a few inference startups focused on diffusion, most are optimized for image rather than video and the underlying stack remains largely nascent. Open-source tooling for LLM inference, including widely adopted inference engines like vLLM and KV cache optimizations like LMCache, have matured significantly over time. Comparable infrastructure for diffusion is only beginning to emerge. Because the stack will continue to take shape over the next few years, early movers have limited durable technical advantage. 

Taken together, the coming DeepSeek moment for diffusion models, the significantly higher compute demands of media generation, and the need for a fundamentally different inference stack than what exists for LLMs make now a compelling time to build a diffusion-focused inference platform. 

If you are a founder working on this, let’s chat.

Let's build something together.

Let's build something together.

Let's build something together.

© 2025 Category Ventures

© 2024 Category Ventures.

Disclosures

© 2024 Category Ventures.

Disclosures