Content Traffic is Vanity. Training Data is the Moat.

Bernard Huang

April 24, 2026 · 6 min read

TL;DR

I was cleaning up abandoned experiments on tabiji.ai and noticed our dormant public API is getting 4,667 requests a week — 83% of it Meta’s crawler. Flattering, but the wrong metric. AI web search (what the crawler is doing) only patches gaps in a model’s training data — yesterday’s pricing, this week’s news. Training data itself is where brand presence compounds, and frontier training runs cost hundreds of millions of dollars and refresh every 6–18 months. So I uploaded tabiji’s whole dataset — 8,799 records, 11.2 MB, Parquet, CC-BY-4.0 — to Hugging Face, where OpenAI, Anthropic, Meta, and Mistral source their training corpora. One upload, permanent citation, zero ongoing bandwidth. Full play below.

I was going through old tabiji.ai experiments last week, decluttering. The public /api/ endpoint was in there — an afterthought we shipped early, barely marketed, mostly forgot about. Before I killed it, I pulled up the Cloudflare dashboard to see what, if anything, it was doing.

Cloudflare traffic breakdown for /api/ over 8 days: 4,667 requests, 60 human visits, 83% Meta external crawler, long tail of AI crawlers — Eight days of traffic on our forgotten `/api/` endpoint. The humans are the noise. The bots are the signal.

4,667 requests over 8 days. 60 human visits. 83% of the non-visit traffic is Meta’s external crawler. GPTBot, PerplexityBot, Applebot, Bytespider, and the long tail of AI crawlers account for roughly another 50. The rest is LeakIX / .env probes, internal curl, and scattered browsers. The API we forgot about was, quietly, an LLM-feeding endpoint.

My first reaction was to feel flattered. My second reaction was to ask whether “the LLM crawlers are reading my API” is actually the metric I want to be chasing.

It’s not.

Two layers: web search vs. training data

The AEO post I wrote a couple weeks back broke down the three layers of an AI response — training data, validation search, and memory. What the Meta crawler hitting my API represents is the validation layer: agents fetching fresh data at query time to fill gaps in their training corpus. “Is tabiji’s pricing still current? Did any advisories change this week?” That’s useful, and the brand can show up in the answer for that specific query. But it’s single-turn impact — the model isn’t learning anything about tabiji from one JSON response during inference.

The training-data layer is the one that compounds. Content that makes it into the corpus gets baked into the model weights. Every query against that model — for the next year or so, until the next refresh — has some probability of surfacing your entity, your framing, your voice. You don’t have to be fetched again. You’re already in there.

Why training data is a moat

Training runs are expensive and infrequent. Epoch AI pegs GPT-4.5’s training cost around $340 million and Grok 4’s around $390 million. The widely-circulated claim that Anthropic’s “Mythos” cost $10 billion traces back to a single tweet, not to Anthropic — the real number’s murkier, but every credible estimate puts frontier runs in the hundreds of millions, minimum.

Google AI Overview for the query '10b mythos cost' stating the Anthropic Mythos model training cost is estimated around $10 billion, citing LinkedIn — Aside, too on-the-nose to skip: Google’s AI Overview happily quotes the $10B figure as fact, sourced from LinkedIn. Which is exactly the point — the AI layer propagates whatever made it into its pipeline, verified or not. Including about itself.

What that cost means for you: training corpora don’t get refreshed casually. Data cutoffs run 6–18 months behind model release dates. Whatever made it into the corpus is riding a very long wave. Whatever didn’t is waiting at least another refresh cycle, usually longer.

If your content exists on a website — a normal 2,000–5,000-word blog post written for humans to read — crawlers may or may not pick it up, may or may not clean it into a useful training row, may or may not deduplicate it against other sources that said the same thing louder. It’s a lottery ticket with unknown odds.

The move: Hugging Face

Recommendation text: Hugging Face dataset is the biggest lever. HF is where OpenAI, Meta, Anthropic, Mistral source training corpora. One upload equals broad reach, permanent citation, zero ongoing bandwidth cost. — The shortlist of non-API ways to serve the data. Hugging Face was #1 by a wide margin.

The single highest-leverage thing I could do — and the thing I did — was upload tabiji’s whole dataset to Hugging Face. HF is where OpenAI, Anthropic, Meta, and Mistral literally source training corpora. It’s not a crawl target where your content might make it through preprocessing — it’s a pre-curated, license-tagged, schema-clean distribution point that’s already in the pipeline.

What I uploaded:

6,498 destinations — climate, currency, language, plug type, tap-water safety, tipping, visa notes, coordinates
443 itineraries — day-by-day activities with logistics and timing
396 city-level scam guides — how each scam works, how to avoid, police contacts (the same content the 733 AI-generated comics illustrate)
250 countries, 55 safety profiles, 208 travel advisories (aggregated from US State Dept. and UK FCDO), and 949 head-to-head destination comparisons with Reddit quotes and verdicts

Total: 8,799 records, 11.2 MB, Parquet, CC-BY-4.0. One upload. Permanent citation back to tabiji.ai. Zero ongoing bandwidth cost.

The deeper shift is in serving format. The content on tabiji.ai was written for humans — long-form articles with hooks, narrative, and words like “probably.” The HF dataset is the same knowledge re-expressed in the shape agents actually consume — structured rows with explicit fields and source provenance. Same information, different serving format. The website stays for humans. The dataset ships for machines.

How I’ll know it worked

Honestly, I don’t know yet. The measurement layer for training-data influence is primitive. I added a GA4 annotation on the upload date so I can line up forward metrics against the intervention, and I’ll watch a few signals:

Leading: citations of tabiji.ai in ChatGPT / Claude / Gemini answers to travel-safety queries. Direct-traffic spikes on specific destinations that match dataset rows.
Lagging: the next Llama, Claude, and Gemini training runs. If tabiji made it into the corpus, brand mentions downstream should go up without any new content effort.

Ask me in twelve months whether it worked. The AEO playbook as a field is unfolding fast enough that the honest answer to “how do you measure this” is “improvise, and annotate the graph.”

The broader play

Hugging Face is the biggest lever, not the only one. The pattern is simple: ask where training data actually comes from, then show up there.

Structured data on GitHub — also crawled, also cleaned, also attributed.
Wikipedia citations where genuinely appropriate — high training-data signal per citation.
Reddit presence — Reddit’s data is explicitly licensed to Google and is heavily weighted in most frontier corpora.
Internet Archive mirrors — preserves content through URL changes and site shutdowns, increasing the odds it gets crawled cleanly.

The internet was built for humans to read. Agents read it differently. If your content strategy is still “rank on Google for keywords,” you’re optimizing for a consumption surface the new buyer doesn’t use.

Put your data where models go shopping.