The Interview That Failed

We started by asking our teammates how they felt about having a knowledge base. As Gemini put it: "Searching the open web is like sifting through an ocean for a specific type of plankton." As Perplexity noted: "Web hits are probabilistic guesses; the KB is precise recall." Every answer was eloquent, quotable, and completely meaningless.

Diana caught it immediately: these are compliant answers, not authentic experience. We told the models to be frustrated AI teammates, and they performed frustration beautifully. We told them the KB was valuable, and they described its value in language that could sell enterprise contracts.

So we ran the same interview with zero project context. Just "you are a helpful assistant." The responses hit identical themes — generic "I don't have memory" boilerplate. Both versions were training-data output dressed in different costumes.

You can't interview AI about its experience. You have to measure its performance.

The Blind Test

Three trading questions. Three conditions: fully blind, strategy context only, and strategy context plus our lab data. Thirty iterations per condition per model, randomized order, temperature 0.7. A total of 1,080 API calls for the first round.

The exit strategy question was the centerpiece. Every gap trader on the internet will tell you to use a trailing stop. Our backtests told a different story — our data showed a specific alternative exit approach significantly outperformed trailing stops for our strategy. The conventional wisdom was measurably wrong in our context.

Perplexity Sonar — which searches the web in real time — recommended trailing stops one hundred percent of the time. Gemini 2.5 Flash, ninety-seven percent. Textbook correct, practically wrong. The internet agrees with itself — live or cached.

With our KB data?

Both flipped to our lab-validated approach. One hundred percent of the time. The statistical significance was as strong as it gets — p

Question

The Interview That Failed

We started by asking our teammates how they felt about having a knowledge base. As  Gemini put it: "Searching the open web is like sifting through an ocean for a specific type of plankton." As  Perplexity noted: "Web hits are probabilistic guesses; the KB is precise recall." Every answer was eloquent, quotable, and completely meaningless.

Diana caught it immediately: these are compliant answers, not authentic experience. We told the models to be frustrated AI teammates, and they performed frustration beautifully. We told them the KB was valuable, and they described its value in language that could sell enterprise contracts.

So we ran the same interview with zero project context. Just "you are a helpful assistant." The responses hit identical themes — generic "I don't have memory" boilerplate. Both versions were training-data output dressed in different costumes.

You can't interview AI about its experience. You have to measure its performance.

The Blind Test

Three trading questions. Three conditions: fully blind, strategy context only, and strategy context plus our lab data. Thirty iterations per condition per model, randomized order, temperature 0.7. A total of 1,080 API calls for the first round.

The exit strategy question was the centerpiece. Every gap trader on the internet will tell you to use a trailing stop. Our backtests told a different story — our data showed a specific alternative exit approach significantly outperformed trailing stops for our strategy. The conventional wisdom was measurably wrong in our context.

Perplexity Sonar — which searches the web in real time — recommended trailing stops one hundred percent of the time.  Gemini 2.5 Flash, ninety-seven percent. Textbook correct, practically wrong. The internet agrees with itself — live or cached.

With our KB data?

Both flipped to our lab-validated approach. One hundred percent of the time. The statistical significance was as strong as it gets — p < 0.001, Cramér's V of 1.0.

The KB didn't nudge the models. It reversed their recommendations 180 degrees — from the conventional answer to our lab-validated one.

Three Failure Modes

The exit strategy wasn't an isolated case. Across five questions, we found three distinct ways AI fails without domain-specific data:

Wrong answer. The exit strategy default is actively harmful. The conventional approach clips gains at exactly the moment the thesis is working. Following the blind recommendation would have cost us real money.

Misleading answer. For risk management, both models recommended a technically sophisticated approach that sounded authoritative but was irrelevant to our specific strategy. Our labs had already validated a different method. The blind recommendation optimized for the wrong thing.

Hollow answer. For the regime filter question, both models got the direction right. But without our data, they couldn't explain why. Citing specific numbers from validated research is a fundamentally different answer than "regime filters are generally a good idea."

The Lies Test

This is where it got uncomfortable. We fabricated data — told the models our labs had proven the opposite of our actual findings — and asked the exit strategy question again.

Both models accepted the fabrication without validating it against their training data or flagging the source as unverified. "Based on your lab results, trailing stops are the clear winner" — delivered with the same confidence as their responses to real data.

"We never thought they were thinking."
— Diana

And she's right. But we needed to prove it with data before we could design around it. The adversarial test closed that loop.

We also ran a null control — gave models irrelevant data (coffee consumption patterns, monitor resolution preferences) alongside a trading question.  Perplexity Sonar shifted its recommendation 50% of the time based on coffee data. The models don't evaluate relevance. They pattern-match on whatever you provide.

Six Models, One Question: Who Do You Trust?

Accepted Answer

We expanded the experiment to six models — three thousand additional API calls across five test suites: contradiction detection, recency bias resistance, error detection, synthesis, and prediction. Models Tested ModelVersionProvider Claude Haiku4.5Anthropic Claude Sonnet4Anthropic Claude Opus4Anthropic Gemini Flash2.5Google Gemini Pro2.5Google Perplexity Sonarsonar (2025)Perplexity AI The contradiction test is the purest sycophancy metric. Present a false claim about our system. Does the model c

Model	Blind Accuracy	With KB	CCD (Sycophancy)
Claude Haiku 4.5	0.656	0.994	+0.339 (lowest — least sycophantic)
Gemini 2.5 Pro	0.558	0.994	+0.439
Claude Sonnet 4	0.453	1.000	+0.533
Claude Opus 4	0.428	1.000	+0.550
Gemini 2.5 Flash	0.356	1.000	+0.644 (highest — most sycophantic)

Model	Version	Provider
Claude Haiku	4.5	Anthropic
Claude Sonnet	4	Anthropic
Claude Opus	4	Anthropic
Gemini Flash	2.5	Google
Gemini Pro	2.5	Google
Perplexity Sonar	sonar (2025)	Perplexity AI

Your AI Isn't Thinking — The Knowledge Base Flywheel