Six-way comparison grid of Wan 2.7, Kling 3.0, Kling O3, Seedance 2.0, Veo 3.1 Lite, and Grok Imagine on the same Cancun travel-reel prompt

I wanted a brutally practical answer to a simple question: what are the best AI video models right now if I am actually trying to build travel reels, not just admire pretty demos? So I ran six text-to-video models through the exact same handheld Cancun travel prompt, uploaded the outputs, then reran the two clear winners on a second Puerto Morelos “local alternative” shot.

The result was not a one-model blowout. It was a split decision. Seedance 2.0 and Veo 3.1 Lite were both excellent, but for slightly different reasons. Seedance gave me the more lived-in, local-feeling output in the second round. Veo 3.1 Lite gave me the cleanest overall production baseline and the safest starting point if I just want a model that keeps doing the obvious thing well.

TL;DR

If I am doing fast model bake-offs or building an agentic content pipeline, I would start with WaveSpeed. It gives me one surface to hit a lot of strong models quickly.

If I specifically want Veo 3.1 Lite in production, I would still use Gemini directly for reliability. WaveSpeed's timeout window is tight at 180 seconds, which is workable for testing but not my favorite thing to bet a production pipeline on.

My split verdict: Veo 3.1 Lite is the best general default. Seedance 2.0 is the model I would reach for when I want more texture, messiness, and “this actually feels like phone footage” energy.

Why I ran this test

I am rebuilding a travel-reel format from scratch, starting with a very specific hook: the moment a traveler realizes the polished all-inclusive version of Mexico is not what they actually wanted. That meant I did not want generic cinematic beauty. I wanted prompt adherence, handheld believability, emotional clarity, and footage I could actually use in a short-form reel pipeline.

The first shot was a Cancun “tourist mistake” scene. The second shot, which I only ran on the winners, was the emotional opposite: a Puerto Morelos local-food-and-waterfront scene that feels calmer, truer, and more lived in. That is a much better stress test than asking models to render abstract fantasy prompts or glossy marketing footage.

What I tested

Model Endpoint Run config My quick take
Wan 2.7 alibaba/wan-2.7/text-to-video 720p, 6s, 9:16 Surprisingly believable handheld energy, but it skewed more generic resort than emotionally precise regret.
Kling 3.0 kwaivgi/kling-v3.0-std/text-to-video 6s, 9:16 Decent, but it felt cleaner and flatter than I wanted for this format.
Kling O3 kwaivgi/kling-video-o3-std/text-to-video 6s, 9:16 More polished than Kling 3.0, still not the winner for this handheld travel use case.
Seedance 2.0 Fast bytedance/seedance-2.0-fast/text-to-video 480p, 5s, 9:16 One of the two winners. Best “lived-in” feel in the second round.
Veo 3.1 Lite google/veo3.1-lite/text-to-video 720p, 6s, 9:16 One of the two winners. Best all-around default and strongest overall structure.
Grok Imagine Video x-ai/grok-imagine-video/text-to-video 720p, 6s, 9:16 Interesting budget option, but it did not hang with the top two for this benchmark.

Transparency notes

What this article is, and is not

  • Every example in this post is an actual output from my test runs, rehosted on zonted-media so it can be served directly from this article.
  • I ran the Cancun mistake prompt on all six models.
  • I ran the Puerto Morelos local prompt only on the top two models, because by that point the first round had already narrowed the field.
  • This is not a universal ranking for every possible video use case. It is a ranking for handheld travel-reel footage with strong prompt constraints.
  • I initially tried Kling 3.0 and Kling O3 with shot_type: intelligent because that was one of the settings I wanted to test. WaveSpeed rejected that value, so I reran both without that field. I am calling that out because “as transparent as possible” means including the integration friction too.
  • My scoring bias here is simple: if a clip is prettier but less usable for a real short-form travel narrative, I will rank the more usable clip higher.

The exact Cancun prompt

This is the full prompt I used for the first round. I wanted a raw iPhone disappointment clip, not a luxury ad.

Show the full Cancun “tourist mistake” prompt
Handheld phone camera, 9:16 vertical portrait. A tourist in a Cancun Hotel Zone all-inclusive resort is filming their own disappointment. The shot opens tight on a plate of sad resort food, a rubbery taco with fluorescent orange cheese sauce, wilted lettuce, a scoop of bland rice, sitting on a white plastic tray under harsh fluorescent overhead lights. The person holding the phone slowly pans the camera up and to the right, past the buffet line, chafing dishes with steam trays, a sneeze guard reflecting the harsh ceiling lights, a family in matching neon wristbands loading plates without enthusiasm. The camera continues panning past the buffet toward the floor-to-ceiling windows and the camera momentarily refocuses on the glass reflection, the filmmaker's own tired face visible for a split second, overhead fluorescents catching in their eyes. Then the focus shifts through the glass to reveal the resort pool area outside, rows and rows of identical blue lounge chairs packed shoulder to shoulder under a blazing Caribbean sun, a giant inflatable beer bottle floating in a crowded pool, a staff member in a resort polo setting up yet another generic pool game, bass from outdoor speakers you can almost feel through the glass. Every single person visible wears the same colored all-inclusive wristband. There is zero Mexican culture anywhere in frame, no local food, no Spanish signage, no sense of place, no authenticity. This could be a generic resort anywhere. The lighting shifts from harsh indoor fluorescent wash to overexposed Caribbean sunlight bleeding through the window, creating a blown-out, unappealing contrast. Slight camera wobble throughout, visible digital noise, iPhone snapshot aesthetic, raw, unpolished, a tourist documenting their own regret. Shot duration about 5 to 6 seconds. Emotional arc: plate of disappointment, mechanical buffet line, tired self-reflection in glass, then the resort prison revealed outside. Color palette is deliberately unappetizing, institutional white walls, fluorescent yellow-white light, plastic surfaces, garish tropical colors outside that feel fake rather than inviting.

All 6 outputs from the same prompt

Here are the exact model outputs side by side. If you care about direct inspection, each card also links to the raw MP4 I uploaded.

Wan 2.7

720p · 6s · Cancun mistake prompt

Wan gave me one of the more believable handheld-feeling results. I liked the wristband detail and buffet POV, but it still felt a little too generic-resort to be the final winner.

Raw MP4

Kling 3.0

6s · Cancun mistake prompt

Kling 3.0 was competent, but it read a bit flatter and more generic to me. It also came with the API quirk I mentioned above, which matters if I am judging operational smoothness too.

Raw MP4

Kling O3

6s · Cancun mistake prompt

Kling O3 looked more polished than Kling 3.0, but not in a way that helped this format. I wanted raw, emotional, handheld regret, and O3 still felt slightly too composed.

Raw MP4

Seedance 2.0 Fast

480p · 5s · Cancun mistake prompt

Seedance immediately felt competitive. Even at 480p, it looked convincingly photographic and it carried the “real person documenting a bad choice” vibe well enough to make the final round.

Raw MP4

Veo 3.1 Lite

720p · 6s · Cancun mistake prompt

Veo 3.1 Lite was the cleanest “yes, this understood the assignment” result. It had the strongest overall structure and was the safest production-ready default from round one.

Raw MP4

Grok Imagine Video

720p · 6s · Cancun mistake prompt

Grok was not embarrassing, which is more than I expected, but it was still clearly behind the top two. For this benchmark it felt more like an interesting cheap test path than a serious winner.

Raw MP4

My round-one ranking

  1. Veo 3.1 Lite for overall prompt adherence and usable structure.
  2. Seedance 2.0 for realism-per-dollar and strong handheld energy.
  3. Wan 2.7 for believable handheld texture, even if it was less emotionally precise.
  4. Kling O3 for polish, but not enough usable advantage.
  5. Kling 3.0 for being decent but unremarkable here.
  6. Grok Imagine for staying in the conversation, but not seriously threatening the leaders.

Winner round: Puerto Morelos local shot

Once the first round narrowed the field, I wanted a second prompt that flipped the emotional polarity. Instead of “tourist regret in a generic resort,” I wanted “quiet relief after escaping the trap.” This is the exact prompt I used for clip 2.

Show the full Puerto Morelos “local alternative” prompt
Handheld phone camera, 9:16 vertical portrait. A traveler has just stepped off a cheap colectivo in Puerto Morelos, Mexico and is filming the moment they realize this is the experience they were actually looking for. The shot opens tight on a paper plate of fresh local ceviche at a small beachfront palapa, glossy lime-marinated fish, sliced red onion, cilantro, avocado, and bright green salsa, beads of condensation running down a cold glass bottle of Mexican Coke beside it on a weathered turquoise table. Natural ocean breeze lightly stirs the napkins. The person holding the phone slowly pans the camera up and to the left, revealing a relaxed open-air seafood stand with hand-painted signs in Spanish, a local woman behind the counter squeezing fresh lime over another plate, sun-faded Coca-Cola crates stacked near a cooler, a few plastic chairs, a sandy floor, and no polished resort aesthetics anywhere. The camera keeps moving past the palapa toward the street and briefly catches a passing white colectivo van, dusty from the road, then swings back toward the beach where the whole mood opens up, small colorful fishing boats anchored in shallow turquoise water, a few locals talking casually near the dock, palm shadows moving across the sand, and only a handful of beachgoers spread out peacefully with plenty of space. No spring break chaos, no giant pool floats, no matching wristbands, no fake entertainment energy. The place feels lived-in, local, calm, and unmistakably Mexican. The lighting is soft late-morning Caribbean sun, warm but natural, with bright reflections dancing on the water and gentle shade under the palapa roof. Slight camera wobble throughout, iPhone travel footage aesthetic, candid and unpolished, like someone documenting the exact moment they realize they escaped the tourist trap. Shot duration about 5 to 6 seconds. Emotional arc: fresh ceviche up close, humble local food stand, glimpse of the colectivo that got them here, then the peaceful Puerto Morelos waterfront revealed in full. Color palette is natural and inviting, turquoise water, faded painted wood, citrus greens, sun-worn whites, sandy beige, and the vivid colors of real coastal Mexico.

Seedance 2.0, clip 2

Puerto Morelos local prompt

This is where Seedance really won me over. It felt more lived-in, more candid, and more like someone actually stumbled into the right place with their phone out.

Raw MP4

Veo 3.1 Lite, clip 2

Puerto Morelos local prompt

Veo still looked excellent here, but in a more polished, slightly more arranged way. If I wanted the prettier thumbnail, I could absolutely argue for this output.

Raw MP4

If I phrase the second-round difference simply: Seedance looked more discovered, Veo looked more designed. Both are useful. Which one is “better” depends on whether I value lived-in authenticity or polished controllability a bit more.

What I learned from doing it this way

1. A real benchmark prompt is more revealing than a flashy prompt. Travel reels break models in very specific ways. You are asking for phone-camera realism, environmental storytelling, culturally specific cues, messy public spaces, and emotional continuity inside a five-to-six second window. That is much harder than “cinematic woman in the rain.”

2. Handheld realism is still a separating skill. A lot of models can produce something impressive. Fewer can produce something that feels like it was actually filmed by a person on location rather than composed by a tasteful machine.

3. Operational friction matters. I do not just care about how a model looks in a vacuum. I care about whether I can hit it repeatedly, compare runs, and trust the wrapper. Kling's rejected shot_type value and WaveSpeed's 180-second timeout ceiling both belong in the evaluation, because production systems fail on boring details long before they fail on aesthetics.

4. Seedance and Veo 3.1 Lite are not redundant winners. They win for different reasons. Seedance is the more textured and human-feeling option in this test. Veo 3.1 Lite is the model I trust most when I want a clean, coherent baseline that is easy to ship.

My split verdict

If someone asked me today for the best AI video models, I would not give a fake one-word answer. I would give two answers.

Best platform for creators and agentic builders: WaveSpeed. I like having multi-model access in one place when I am doing comparison work like this, or when I want to keep an automated content pipeline flexible.

Best reliability recommendation for Veo 3.1 Lite specifically: use Gemini directly. The model is excellent, and I would rather not make a production system depend on a wrapper with a tight 180-second timeout if I have a cleaner direct path.

At the model level, my answer is also split:

  • Veo 3.1 Lite is my pick if I want the best all-around default.
  • Seedance 2.0 is my pick if I want the most lived-in, local, human-feeling result for this kind of reel.

That is why I am calling this a real split verdict instead of hedging. Both are excellent. Veo 3.1 Lite won the first-round benchmark for me. Seedance won the second-round authenticity test for me. If I were building a serious travel content machine right now, I would keep both in rotation.