Side-by-side comparison of AI-generated vintage 1970s photographs of Fushimi Inari by Nano Banana 2 (9.5/10), MiniMax (5.5/10), and CogView-4 (4/10) — same prompt, dramatically different results

TL;DR — 5 Models, 26 Images, Two Benchmarks

We tested GPT (DALL-E 3), Grok Aurora, Nano Banana 2 (Gemini), MiniMax image-01, and CogView-4 across two real production workflows: vintage 1970s Kodachrome film photography (18 images) and iPhone-realistic phone photos for Instagram Reels (8 images).

The biggest differentiator wasn't image quality — it was prompt adherence. Most models produce decent-looking images. Only Nano Banana 2 consistently does what you actually ask. It scored 8.8/10 for vintage photography, 9/10 for crowd scenes, and costs ~$0.02/image. For portraits, MiniMax leads. For an all-rounder, Grok. GPT finished last for both use cases.

Why This Test Matters

Most AI image generation comparisons test generic prompts — "a cat wearing a hat," "futuristic cityscape." That tells you which model makes the prettiest pictures. It doesn't tell you which model follows instructions.

We needed to know something different. We build AI-generated travel itineraries at tabiji, and our Instagram Reels need two very specific types of images: vintage film photographs styled to look like 1970s Kodachrome slides, and iPhone-realistic phone photos that pass as candid tourist shots. Both require strict stylistic control — specific film grain, era-appropriate composition, accurate text rendering, and deliberate "imperfection."

If a model can't follow a detailed brief for vintage Kodachrome, it can't follow a detailed brief for anything. This test is a proxy for prompt adherence across any demanding creative workflow — whether you're building mockups, infographics, social media content, or production imagery at scale.

The 5 Models We Tested

ModelProviderCost/ImageMax ResolutionTests
Nano Banana 2Google (Gemini 3.1 Flash)~$0.02Up to 4KKyoto + Verona
MiniMax image-01MiniMax~$0.04Up to 2KKyoto + Verona
CogView-4Z.AI (Zhipu AI)~$0.02720×1440Kyoto only
Grok AuroraxAI~$0.07Up to 2KVerona only
GPT (DALL-E 3)OpenAI~$0.04–$0.081024×1792Verona only

We ran two separate benchmarks. Test 1 (Kyoto) tested vintage film photography across 4 iconic landmarks with 3 models — Nano Banana 2, MiniMax, and CogView-4. Test 2 (Verona) tested iPhone-realistic phone photos across 2 scene types with 4 models — adding GPT and Grok while dropping CogView-4.

Test 1: Vintage 1970s Kyoto — Can AI Fake Film Photography?

We ran three models through four Kyoto landmarks — Fushimi Inari, Kinkaku-ji, Arashiyama, and Gion — with prompts requesting 1970s Kodachrome film aesthetic: warm saturated midtones, vintage grain, amateur composition, and era-appropriate details. Same prompt, same day, direct comparison. Here's Fushimi Inari — the most revealing of the four:

AI-generated vintage 1970s photograph of Fushimi Inari torii gates in Kyoto, created by Nano Banana 2 (Gemini 3.1 Flash) — rated 9.5/10 for authenticity
Nano Banana 2 9.5/10
AI-generated vintage photograph of Fushimi Inari torii gates by MiniMax image-01 — warm color science but wrong POV, rated 5.5/10
MiniMax 5.5/10
AI-generated photograph of Fushimi Inari by CogView-4 — cinematic modern look instead of vintage, garbled kanji, rated 4/10
CogView-4 4/10

▶ Animated — Vintage POV Reel Clips

Nano Banana 2 — Animated
MiniMax — Animated

These clips show the same images after Remotion rendering — with film grain, vignette, and slow drift animation applied.

The pattern was consistent across all four landmarks. Nano Banana 2 produced the most convincing vintage imagery every time — correct Kodachrome warm tones, legible Japanese kanji (you can read "奉" and "納" at Fushimi Inari), authentic amateur composition, and proper architectural proportions. MiniMax had beautiful color science but consistently ignored POV instructions and produced compositions too polished for a tourist snapshot. CogView-4 defaulted to modern cinematic aesthetics regardless of the prompt — orange-teal color grading, HDR dynamic range, garbled kanji.

Here are Kinkaku-ji and Arashiyama — the same story:

AI-generated vintage 1970s photograph of Kinkaku-ji Golden Pavilion reflected in mirror pond, created by Nano Banana 2 — muted gold tones, rated 8/10
NB2 — Kinkaku-ji 8/10
AI-generated photograph of Kinkaku-ji by MiniMax image-01 — warm tones but overly polished composition, rated 6/10
MiniMax — Kinkaku-ji 6/10
AI-generated photograph of Kinkaku-ji by CogView-4 — modern HDR look with no vintage character, rated 3/10
CogView-4 — Kinkaku-ji 3/10

▶ Animated — Kinkaku-ji Reel Clips

Nano Banana 2 — Animated
MiniMax — Animated
AI-generated vintage 1970s photograph of Arashiyama bamboo grove in Kyoto by Nano Banana 2 — olive-shifted greens mimicking Kodachrome, rated 8.5/10
NB2 — Arashiyama 8.5/10
AI-generated photograph of Arashiyama bamboo grove by MiniMax — warm tones but too vivid for vintage, rated 6/10
MiniMax — Arashiyama 6/10
AI-generated photograph of Arashiyama bamboo grove by CogView-4 — oversaturated modern cinematic look, rated 2.5/10
CogView-4 — Arashiyama 2.5/10

▶ Animated — Arashiyama Reel Clips

Nano Banana 2 — Animated
MiniMax — Animated

The Black & White Test That Broke Two Models

This was the most revealing test in our entire benchmark. We asked for "absolutely no color whatsoever, silver gelatin print" — a geisha walking through Gion's lantern-lit streets at dusk, shot on black and white film.

AI-generated black and white silver gelatin photograph of Gion district in Kyoto by Nano Banana 2 — stunning monochrome with deep blacks and luminous lanterns, rated 9/10
Nano Banana 2 9/10
MiniMax failed black and white test for Gion evening scene — rendered in full color despite explicit B&W prompt, rated 2/10
MiniMax 2/10
CogView-4 completely ignored black and white instruction for Gion scene — bright orange and red colors, rated 0/10 for prompt adherence
CogView-4 0/10

▶ Animated — Gion Evening Reel Clips

Nano Banana 2 — Animated
MiniMax — Animated

Nano Banana 2 delivered a stunning silver gelatin print. Pure monochrome, zero color bleed. Deep inky blacks in the machiya facades, luminous lantern highlights, visible grain consistent with Tri-X film stock. The kind of image you'd expect in a Daidō Moriyama photobook.

MiniMax rendered in full color. Warm amber lanterns, teal shadows. Attractive? Sure. But the prompt said "absolutely no color whatsoever." It ignored that completely.

CogView-4 was the worst offender. Bright orange lanterns, vivid red obi accents, warm pavement reflections. Not just "not black and white" — aggressively, blatantly colorful.

Prompt adherence is the single most important differentiator between AI image models. Most models produce decent images. Only some actually do what you ask. When we said "black and white," Nano Banana 2 gave us black and white. The other two gave us whatever they felt like.

Test 2: Phone Camera Realism — Verona

Vintage film is one use case. The other half of our pipeline needs images that pass as real iPhone photos — handheld grain, natural depth of field, authentic crowd behavior, legible signage. For this test, we added GPT (DALL-E 3) and Grok (Aurora) while dropping CogView-4. Two prompts, four models, eight images.

Which AI model handles crowd scenes best?

Prompt: An exhausted female tourist photographed from behind in a dense crowd outside Casa di Giulietta in Verona. iPhone photo quality — handheld, natural grain, depth of field, motion blur. Juliet's bronze statue visible ahead. Tourist overwhelm is palpable.

GPT DALL-E 3 generated image of Juliet's House crowd scene — over-cinematic, garbled text on signs, porcelain-smooth faces, rated 6/10
GPT (DALL-E 3) 6/10
Grok Aurora generated image of Juliet's House crowd scene — incredibly expressive exhausted woman, natural crowd density, rated 8.5/10
Grok (Aurora) 8.5/10
Nano Banana 2 generated image of Juliet's House crowd scene — most photorealistic, legible Juliet text, correct Fjällräven backpack, rated 9/10
Nano Banana 2 9/10 🏆
MiniMax generated image of Juliet's House crowd scene — too dark, bokeh too cinematic, reads as dangerous alley not tourist trap, rated 7/10
MiniMax 7/10

Nano Banana 2 won decisively (9/10). The most photorealistic result — "Juliet" text on souvenirs is legible, the Fjällräven backpack logo on the main subject is rendered correctly (the model knew the specific brand), and a person holding a phone in the foreground adds authentic casual detail. Hardest to identify as AI.

Grok Aurora came close (8.5/10). The exhausted woman is incredibly expressive and natural — best emotional storytelling of any model. Some face dissolution in deeper crowd, but the focal subject is flawless.

MiniMax (7/10) was too dark — bokeh and lighting read as Sony A7III, not iPhone. The scene looks like a dangerous alley, not a tourist trap. GPT (6/10) garbled all sign text ("SLOLEROCVEE" instead of readable Italian) and produced porcelain-smooth faces immediately identifiable as AI.

Is GPT good for portrait-style travel photos?

Prompt: Two Italian men in their 50s laughing and drinking wine at an outdoor café in Piazza delle Erbe, Verona. Candid iPhone shot — natural light, movement blur, authentic body language. Amarone bottle visible. Medieval tower in background.

GPT DALL-E 3 generated image of Italian men at Piazza delle Erbe café — over-saturated, hyper-rendered, uncanny faces, tourism ad look, rated 4/10
GPT (DALL-E 3) 4/10
Grok Aurora generated image of Italian men at Piazza delle Erbe café — best authentic mood, convincingly Italian men, natural laughter, rated 7/10
Grok (Aurora) 7/10
Nano Banana 2 generated image of Italian men at Piazza delle Erbe café — best prompt adherence, legible Amarone label, wine glass + bottle, rated 6.5/10
Nano Banana 2 6.5/10
MiniMax generated image of Italian men at Piazza delle Erbe café — most photographically convincing, best faces, natural lighting, documentary feel, rated 8/10
MiniMax 8/10 🏆

MiniMax won this round (8/10). Most photographically convincing — the men look like real, specific humans rather than generically "Italian-looking" AI faces. Natural lighting, documentary feel, intimate mood. MiniMax's cinematic tendencies, which hurt it in crowd scenes, produced exactly the right result for portraits.

Grok (7/10) had the best authentic mood — natural laughter, convincingly Italian men. Minor tells: matching sweaters and missing wine glasses. Nano Banana 2 (6.5/10) had the best prompt adherence (legible Amarone label, wine glass + bottle present) but the scene felt staged. GPT (4/10) was dead last — over-saturated, hyper-rendered, immediately recognizable as AI.

📱 Key Finding: Different Models Win Different Scene Types

Nano Banana 2 dominates crowd and location scenes — text rendering, brand accuracy, and iPhone aesthetic make it the clear winner for anything with signage, landmarks, or branded items.

MiniMax dominates intimate people/portrait scenes — when the faces are the subject, MiniMax's human rendering is unmatched.

Grok is the strongest all-rounder — close second in both categories, best emotional storytelling overall. If you're picking one model for everything, pick Grok.

GPT/DALL-E 3 is dead last for photorealism. Garbled text, over-rendered aesthetics, and uncanny faces make it unsuitable for content that needs to pass as real.

Prompt Engineering Makes or Breaks It

After the initial Kyoto round, we rewrote our prompts with much more specific technical detail. The improvement was dramatic — but only on models that actually listen.

  • V1 (vague): "Shot on Kodachrome film, vintage feel"
  • V2 (specific): "Warm saturated midtones, slightly cool shadows, limited dynamic range with clipped highlights and blocked shadows. Include scan artifacts: dust specks, hair, scratches. Amateur composition, slightly off-center. Chromatic aberration, lens softness at edges. No borders, no frame edges."
Nano Banana 2 Fushimi Inari with basic V1 prompt — good vintage feel but missing physical film artifacts
Nano Banana 2 — V1 Prompt
Nano Banana 2 Fushimi Inari with detailed V2 prompt — added Kodachrome film rebate markings, exposure-dependent grain, and scan artifacts
Nano Banana 2 — V2 Prompt

The V2 version is dramatically more convincing. Nano Banana 2 responded by adding Kodachrome film rebate markings along the frame edge — "12 KODACHROME" in the characteristic orange-on-black typography, with frame numbers and orientation arrows. These are technically accurate references to how real Kodachrome slides look when scanned from their original mounts. The grain also became exposure-dependent (clumping in shadows, finer in highlights) — exactly how real silver halide crystals behave on actual film.

MiniMax Fushimi Inari with basic V1 prompt — warm tones but modern composition
MiniMax — V1 Prompt
MiniMax Fushimi Inari with detailed V2 prompt — warmer and moodier but still unable to produce physical film artifacts
MiniMax — V2 Prompt

MiniMax improved moderately — warmer tones, a subtle light leak — but couldn't produce the physical film artifacts that V2 prompts requested. No border markings, no scan lines, no stock-specific text. Better prompts made it warmer and moodier, but couldn't make it look like actual film.

The gap between a lazy prompt and a detailed one is bigger than the gap between models. But only on models that actually follow instructions. Nano Banana 2's prompt engineering ceiling is virtually unlimited. MiniMax improves incrementally. CogView-4 largely ignores the details.

Combined Scorecard: All 5 Models

CategoryNano Banana 2MiniMaxCogView-4Grok AuroraGPT (DALL-E 3)
Prompt Adherence9.5/105/102/107/105/10
Vintage Film8.8/105.9/103.6/10
Crowd Scenes9/107/108.5/106/10
People/Portraits6.5/108/107/104/10
Text / Kanji Accuracy8/104/101.5/106/103/10
Prompt Eng. Ceiling10/105/103/107/105/10
Emotional Storytelling7/107/106/109/105/10
Cost per Image~$0.02~$0.04~$0.02~$0.07~$0.04–$0.08

What to Use for What

After 26+ images across 5 models, here's the decision tree:

  • Crowd scenes, landmarks, anything with text or signage → Nano Banana 2. Nothing else comes close for photorealism + text accuracy.
  • Close-up portraits, people-focused shots → MiniMax image-01. Best faces, most natural human rendering.
  • All-rounder / only picking one model → Grok Aurora. Strong in both categories, best emotional storytelling.
  • Vintage film photography → Nano Banana 2. The only model that treats stylistic constraints as instructions.
  • Generic social media / "just make it look good" → Grok or MiniMax. Both produce visually striking content without much prompt engineering.
  • Content with CJK text (Japanese, Chinese, Korean) → Nano Banana 2. Only reliable option for correct text rendering.
  • Not recommended: GPT/DALL-E 3 for anything that needs to pass as a real photo. Over-rendered aesthetic is immediately identifiable as AI.
  • Not recommended: CogView-4 for anything requiring stylistic control. Beautiful cinematic images, but ignores what you ask for.

Pricing Comparison

FactorNano Banana 2MiniMaxCogView-4Grok AuroraGPT (DALL-E 3)
Cost per image~$0.02~$0.04~$0.02~$0.07~$0.04–$0.08
6-image Reel cost~$0.12~$0.24~$0.12~$0.42~$0.24–$0.48
Max resolutionUp to 4KUp to 2K720×1440Up to 2K1024×1792
Free tierYes (Gemini API)LimitedLimitedLimitedNo
API complexityModerate (SDK)Simple RESTSimple RESTSimple RESTSimple REST
Latency~8–15s~10–20s~15–25s~10–15s~10–20s

Nano Banana 2 wins on price-to-quality ratio. Same cost as CogView-4 with dramatically better results. Less than half the cost of Grok with comparable or better output for most use cases. The Gemini API free tier means you can experiment before spending anything.

How to Use These Models (Code Examples)

All five models are accessible via API. Here's how to call each one.

Nano Banana 2 (Google Gemini 3.1 Flash Image)

import google.generativeai as genai

genai.configure(api_key="YOUR_GEMINI_API_KEY")
model = genai.GenerativeModel("gemini-3.1-flash-image-preview")

response = model.generate_content(
    "A vintage 1970s Kodachrome photograph of Fushimi Inari torii gates. "
    "Warm saturated midtones, slightly cool shadows, limited dynamic range. "
    "Include scan artifacts: dust specks, scratches. Amateur composition.",
    generation_config=genai.GenerationConfig(
        response_modalities=["IMAGE", "TEXT"]
    )
)

# Save the image
for part in response.candidates[0].content.parts:
    if part.inline_data:
        with open("output.png", "wb") as f:
            f.write(part.inline_data.data)

MiniMax image-01

curl -X POST "https://api.minimax.chat/v1/image_generation" \
  -H "Authorization: Bearer YOUR_MINIMAX_KEY" \
  -H "Content-Type: application/json" \
  -d '{
    "model": "image-01",
    "prompt": "A vintage 1970s photograph of Fushimi Inari torii gates...",
    "aspect_ratio": "9:16",
    "response_format": "url"
  }'

CogView-4 (Z.AI)

curl -X POST "https://open.bigmodel.cn/api/paas/v4/images/generations" \
  -H "Authorization: Bearer YOUR_ZAI_KEY" \
  -H "Content-Type: application/json" \
  -d '{
    "model": "cogview-4-250304",
    "prompt": "A vintage 1970s photograph of Fushimi Inari torii gates...",
    "size": "720x1440"
  }'

Grok (xAI Aurora)

curl -X POST "https://api.x.ai/v1/images/generations" \
  -H "Authorization: Bearer YOUR_XAI_KEY" \
  -H "Content-Type: application/json" \
  -d '{
    "model": "grok-2-image-1212",
    "prompt": "Phone camera photo of exhausted tourist in crowd...",
    "n": 1,
    "response_format": "url"
  }'

GPT / DALL-E 3 (OpenAI)

curl -X POST "https://api.openai.com/v1/images/generations" \
  -H "Authorization: Bearer YOUR_OPENAI_KEY" \
  -H "Content-Type: application/json" \
  -d '{
    "model": "dall-e-3",
    "prompt": "Phone camera photo of exhausted tourist in crowd...",
    "n": 1,
    "size": "1024x1792",
    "quality": "standard"
  }'

For production use, start with Google AI Studio (free tier includes Gemini image generation) and experiment with detailed prompts before scaling up.

Full Reels: Side by Side

Numbers and screenshots only tell part of the story. Here are the complete assembled Reels — the actual final output of our vintage POV pipeline. Each Reel sequences all Kyoto scenes with Remotion-rendered film grain, vignette, slow drift animation, and text overlays.

🏆 Nano Banana 2 — Full Kyoto Reel (21s)
MiniMax — Full Kyoto Reel (21s)

Both Reels use identical Remotion compositions and timing. The only difference is the source images. Notice how Nano Banana 2's vintage authenticity carries through to the animated version — the film grain and warm tones feel cohesive, while MiniMax's modern rendering creates a subtle disconnect with the vintage effects layer.


All images in this comparison were generated from identical prompts on the same day (March 10, 2026). No post-processing was applied. The images shown are direct outputs from each model's API.