AI Image Model Benchmark 2026: Character Consistency Test

You generate an image for your article cover. Looks great. Next article, same character, same style, roughly the same prompt. Completely different face. Not a variation. A different person. The model just decided your character needed a new identity, like a witness protection program but for pixels.

On a single article, who cares. On a publication where the same characters show up across dozens of posts, it's a branding problem you don't notice until a reader asks "wait, is that supposed to be the same guy?" Pixar can't ship Woody with a different face in every scene. A comic book neither. Your blog neither, if you actually want people to recognize your covers while they scroll past.

TLDR: I tested 13 AI image models for character consistency. Two survived. The cheapest one won. And the reason has nothing to do with prompting.

Two office workers comparing AI character generation results with different facial outputs — AI image models: Where consistency goes to die, one render at a time.

The Problem Nobody Mentions When They Ship AI-Generated Visuals

Every model demo shows you one stunning image. Nobody shows you the same character generated ten times in a row. Because that's where it falls apart.

My setup: two recurring characters on every article cover, generated through an automated pipeline via fal.ai. An office worker with a tie and a permanently deadpan expression, and a caped hero. Same characters, same universe, same visual identity across every piece of content. First few articles, fine. Then the faces started drifting. Subtly at first (jawline slightly off, hair parting on the wrong side), then not subtly at all (completely different human being staring back at me).

I could keep adjusting prompts by hand every time, like some kind of prompt-whisperer hoping the next generation would land. Or I could just test every model available on the platform and figure out which ones actually hold.

So I tested all thirteen. One shared prompt template. Same two character references injected every time. Systematic, not vibes.

Without consistency, every image is a one-shot. With a pipeline, that's a leak you don't see until your readers do.

Character Consistency Isn't a Prompt Problem. It's an Endpoint Problem.

Most people troubleshoot character consistency by rewriting prompts. Adding more detail, tweaking the style description, playing with seeds. I did the same for three weeks. It was like tuning a guitar that has no strings.

The actual mechanism is simpler and more brutal: positional reference injection via /edit endpoints with multiple image_urls. Not a prompt trick. Not a parameter. An endpoint.

Your character reference images get sent as data URIs in the image_urls field. The model maps placeholders like @image1 and @image2 onto those references. The prompt says "put @image1 on the left", the model knows exactly which face to use. Two references in, two consistent characters out.

The models that fail? All the same architectural reason: they don't support two simultaneous image references. Kontext Pro takes one reference max. Ideogram Character, same story. Ideogram V3 takes zero references (text description only). When the pipeline hits a model that can't take two refs, it falls back to adaptPromptForModel(), which replaces @image2 with something like "an office worker character with tie and deadpan expression."

You can describe a face in forty words. The model will still generate a different face every single time. That part is settled.

The model doesn't fail on quality. It fails before it even reads your prompt.

The Benchmark: 13 Models, 3 Tiers, One Winner at $0.012

Two models got five stars. Seven got zero.

Tier S (5 stars, in production): Flux 2 Dev at $0.012/MP and Nano Banana 2 at roughly $0.01/MP. Both support multiple image_urls in /edit mode, both hold character across consecutive generations.

Tier A (4 stars, usable): Flux 2 Pro, GPT Image 1.5, Flux 2 Turbo, Flux 2 Flash. They handle the dual reference mechanism but with minor drift between runs.

Tier F (0-1 star, rejected): Flux 2 Max, Flux 2 Flex, Kontext Pro/Multi/Max, Ideogram V3/Character. All fail for the architectural reason above.

Two results that deserve a closer look.

Flux 2 Max: 1 star at $0.07/MP. Almost 6x the price of Dev. And it was worse. It added gloves to a character who never had them. Lightning effects behind a hero who was supposed to be standing still. The premium model hallucinates costume elements while the budget one just does what you asked. (Karen from Accounting would have something to say about that ROI.)

Kontext and Ideogram: both scored zero despite being marketed specifically for character reference. Their endpoint simply does not accept two simultaneous image_urls. One ref, sure. Two refs for two characters in the same scene? Architecture says no. The marketing page disagrees with the API documentation, and I know which one I trust more.

AI image model comparison grid showing character consistency across Flux 2 Dev, GPT Image 1.5, and Kontext Pro — Performance comparison of AI image models in generating consistent character references.

This benchmark covers a specific case: two recurring characters, 90s comic book style, automated pipeline. Midjourney wasn't tested (no fal.ai API). Results may differ on other styles. But the /edit with multiple image_urls is the necessary condition regardless.

Paying 6x more gave me superhero gloves I never asked for.

Four Traps That Break Character Consistency Even With the Right Model

Right model, wrong setup. I hit every single one of these, and each one cost me more time than finding the right model in the first place.

The @image1/@image2 swap. One Monday I noticed all generated covers had the characters on the wrong side. The office worker had the cape. The hero was wearing a tie. Like a buddy cop movie where costume department mixed up the actors. Same prompt, same refs, but @image1 and @image2 were inverted in the template. Found the commit (6c70e20), realized the mapping got silently swapped during a refactor. No test caught it because nobody tests "is the right character wearing the right outfit." Now I do.

Dimensions in the prompt. When the LLM generates the image prompt, it sometimes writes "1536x1024px cinematic scene." The image model then tries to render "1536x1024px" as literal text inside the image. Beautiful. Fix: sanitizePrompt() strips dimension patterns via regex. Actual dimensions go through the image_size API parameter only.

Data URI vs HTTP URL. Some models (Kontext, Ideogram) crash silently with base64 data URIs. They need HTTP URLs pointing to publicly accessible files. Fix: ensureHttpUrl() uploads the reference to cloud storage before the API call. Two lines of code that took three days of debugging to discover were necessary.

Expensive means over-interpretation. Flux 2 Max doesn't just generate what you ask. It improves your request by adding elements it thinks should belong in a comic book scene. That's not a bug, that's a design philosophy. And that design philosophy is incompatible with pipeline use where you need the exact same output style every time. Same spec-first reasoning I use through prompt contracts applies: test assumptions before building around a model, not after.

Four lines of code cost less than three weeks of adjusting by feel.

Six Prompting Rules That Actually Move the Needle

Once the model and the pipeline are right, these six rules made a visible difference. All discovered by testing.

1. Subject first.

"90s comic book style illustration of two characters 
standing in a server room"

"Two characters standing in a server room, @image1 on 
the left pointing at a screen, comic book style"

Flux models weight the beginning of the prompt more heavily. Whatever comes first gets the most attention. Put the scene and action before the style.

2. Name the characters on top of the refs. @image1 injects the face, but adding a name ("Phil the developer", "Captain Compliance") anchors the pose and expression. The model treats named entities different than anonymous references. I did not expect this to matter, but it does.

3. Limit complexity. Max three panels, two speech bubbles, eight words per bubble. Beyond that the model sacrifices character consistency to respect the composition. It can't do everything so it drops the hardest part first. The hardest part is always the faces.

4. Zero dimensions in the prompt text. Already mentioned in the traps. Worth repeating: never write pixel dimensions in the prompt. Use image_size API parameter. Always.

5. Branding in-world, not overlay.

"watermark @rentierdigital bottom right"

"@rentierdigital engraved on the bezel of a monitor 
in the background"

Make the brand part of the scene, not a post-processing instruction. The model doesn't understand "overlay." It understands objects in a scene.

6. guidance_scale at 3.5 for Flux 2 Dev. Below 2, the model ignores half your prompt. Above 5, artifacts and over-saturation. For speed variants (Turbo, Flash), 2.5 with 8 steps is enough. I found this by generating the same prompt at every value from 1 to 7. Not glamorous, but it works.

A prompt that tries to specify everything gives the model permission to pick which parts to ignore. Pick your battles.

The Bottom Line

Two models out of thirteen. That's the ratio. Not because they're "better" in any general sense, but because they natively support /edit with multiple image_urls. The entire ranking follows from that one API detail. Not from image quality, not price, not the model's reputation.

Before picking an image model for any content pipeline, one question: does the endpoint accept multiple image_urls in /edit mode? If not, everything else is burned time. Test with the same prompt, the same references, three consecutive generations. Not one.

The only way to get consistent characters is to feed the model a reference image. Otherwise you can spend two days writing the perfect prompt. Won't change a thing. The answer was in the API docs, page 2, parameter image_urls. Character consistency is not a prompting problem. It's a documentation-reading problem.

Sources: fal.ai model documentation for endpoint specs and pricing.

(*) The cover is AI-generated. Which, given the article, you probably already guessed. The irony is that it took three tries to get the characters right on this one too.

Character consistency in AI image pipelines isn't a prompt problem—it's an architectural one. Most models fail the same way, and the reason has nothing to do with tweaking your descriptions. We cover production patterns like this weekly.

→ Subscribe for free

I Benchmarked 13 AI Image Models for Character Consistency. Most Failed the Same Way.

Kontext. Ideogram. The expensive ones. All zero stars. The $0.012 model won. Here's why.