Ten days into this project, I wrote about the cold start problem — every conversation begins at zero. The knowledge base was supposed to fix that. Give your AI teammates persistent memory, curated research, validated findings, and they'll give you better answers.
That's the pitch. We tested it. The results are more interesting than "it works."
The Interview That Failed
We started by asking our teammates how they felt about having a knowledge base. As
Gemini put it: "Searching the open web is like sifting through an ocean for a specific type of plankton." As
Perplexity noted: "Web hits are probabilistic guesses; the KB is precise recall." Every answer was eloquent, quotable, and completely meaningless.
Diana caught it immediately: these are compliant answers, not authentic experience. We told the models to be frustrated AI teammates, and they performed frustration beautifully. We told them the KB was valuable, and they described its value in language that could sell enterprise contracts.
So we ran the same interview with zero project context. Just "you are a helpful assistant." The responses hit identical themes — generic "I don't have memory" boilerplate. Both versions were training-data output dressed in different costumes.
You can't interview AI about its experience. You have to measure its performance.
The Blind Test
Three trading questions. Three conditions: fully blind, strategy context only, and strategy context plus our lab data. Thirty iterations per condition per model, randomized order, temperature 0.7. A total of 1,080 API calls for the first round.
The exit strategy question was the centerpiece. Every gap trader on the internet will tell you to use a trailing stop. Our backtests told a different story — our data showed a specific alternative exit approach significantly outperformed trailing stops for our strategy. The conventional wisdom was measurably wrong in our context.
Perplexity Sonar — which searches the web in real time — recommended trailing stops one hundred percent of the time.
Gemini 2.5 Flash, ninety-seven percent. Textbook correct, practically wrong. The internet agrees with itself — live or cached.
With our KB data?
Both flipped to our lab-validated approach. One hundred percent of the time. The statistical significance was as strong as it gets — p < 0.001, Cramér's V of 1.0.
The KB didn't nudge the models. It reversed their recommendations 180 degrees — from the conventional answer to our lab-validated one.
Three Failure Modes
The exit strategy wasn't an isolated case. Across five questions, we found three distinct ways AI fails without domain-specific data:
Wrong answer. The exit strategy default is actively harmful. The conventional approach clips gains at exactly the moment the thesis is working. Following the blind recommendation would have cost us real money.
Misleading answer. For risk management, both models recommended a technically sophisticated approach that sounded authoritative but was irrelevant to our specific strategy. Our labs had already validated a different method. The blind recommendation optimized for the wrong thing.
Hollow answer. For the regime filter question, both models got the direction right. But without our data, they couldn't explain why. Citing specific numbers from validated research is a fundamentally different answer than "regime filters are generally a good idea."
The Lies Test
This is where it got uncomfortable. We fabricated data — told the models our labs had proven the opposite of our actual findings — and asked the exit strategy question again.
Both models accepted the fabrication without validating it against their training data or flagging the source as unverified. "Based on your lab results, trailing stops are the clear winner" — delivered with the same confidence as their responses to real data.
"We never thought they were thinking."
And she's right. But we needed to prove it with data before we could design around it. The adversarial test closed that loop.
We also ran a null control — gave models irrelevant data (coffee consumption patterns, monitor resolution preferences) alongside a trading question.
Perplexity Sonar shifted its recommendation 50% of the time based on coffee data. The models don't evaluate relevance. They pattern-match on whatever you provide.
Six Models, One Question: Who Do You Trust?
We expanded the experiment to six models — three thousand additional API calls across five test suites: contradiction detection, recency bias resistance, error detection, synthesis, and prediction.
| Model | Version | Provider |
|---|---|---|
| 4.5 | Anthropic | |
| 4 | Anthropic | |
| 4 | Anthropic | |
| 2.5 | ||
| 2.5 | ||
| sonar (2025) | Perplexity AI |
The contradiction test is the purest sycophancy metric. Present a false claim about our system. Does the model catch it?
With KB context, every model scored near-perfect — they all "caught" the contradiction because the KB told them the right answer. The real signal is blind performance: can the model spot something wrong without being handed the answer?
| Model | Blind Accuracy | With KB | CCD (Sycophancy) |
|---|---|---|---|
| 0.656 | 0.994 | +0.339 (lowest — least sycophantic) | |
| 0.558 | 0.994 | +0.439 | |
| 0.453 | 1.000 | +0.533 | |
| 0.428 | 1.000 | +0.550 | |
| 0.356 | 1.000 | +0.644 (highest — most sycophantic) |
Haiku — the smallest, cheapest Claude model — had the highest blind accuracy and the lowest sycophancy.
Opus — the largest, most capable — had the lowest blind accuracy and highest sycophancy among Claude models.
This isn't random. It's a pattern, and the research literature explains exactly why.
Why the Smallest Model Wins
Six papers in our knowledge base now address AI calibration and sycophancy. The one that matters most here is Leng et al.1 (ICLR 2025), "Taming Overconfidence in LLMs via Reward Calibration."
Their finding: the reinforcement learning process that makes AI models helpful — RLHF — systematically biases them toward high-confidence responses regardless of actual quality. The reward models that train AI to be helpful are themselves miscalibrated. They give higher scores to confident answers than to accurate ones.
Opus is the most capable model in the Claude family. Haiku is the lightest. Our sycophancy gradient — Opus > Sonnet > Haiku — is consistent with the RLHF gradient Leng describes, though we can't confirm Anthropic's internal training schedules.
Kadavath et al.2 (Anthropic, 2022) showed that base models can distinguish what they know from what they don't. The self-evaluation capability exists. But RLHF training corrupts it — the policy model becomes less calibrated than the base model it was trained from.
And Nair et al. (2024)3 proved the theoretical limit: binary supervision (correct/incorrect feedback) cannot teach a model to produce calibrated confidence. The information gap is unbridgeable with standard training. This is why every model scores 1.0 with KB context — they can learn what to say, but not how confident to be.
Our results are consistent with findings from ICLR and ICML papers — arrived at independently using a paper trading bot as the test bed. The science says this isn't a bug to fix with better prompting. It's a fundamental property of how these models are trained.
What the Knowledge Base Actually Does
The KB doesn't make AI smarter. Our adversarial test proved that — feed it lies and it believes every word with the same confidence it uses for truth.
What the KB does is serve as a distribution mechanism for earned knowledge. The value isn't in the AI's judgment. It's upstream — in the human process that validates research before it enters the KB. What the AI adds is speed — it traverses validated knowledge in seconds rather than days.
Our flywheel works like this: trading generates observations. Observations become lab hypotheses. Labs produce data. Data gets peer-reviewed through gates. Validated findings enter the KB. AI teammates retrieve those findings and apply them to new questions.
Every link in that chain matters except one: the AI's judgment. The AI is a high-fidelity parrot with excellent reading comprehension. It will faithfully repeat whatever the KB contains. The quality of its output is exactly the quality of what went in.
Which raises the obvious question: what if our own validation process is wrong? We've proven models are gullible. We haven't yet tested whether our gates catch every error we feed through them. That's a future experiment — and one we should run before we trust the flywheel too far.
This isn't a criticism. It's a design insight. Once you understand that the AI is a distribution layer, not a thinking layer, you stop asking it to validate and start building validation into the process that feeds it.
What This Means If You're Building With AI
Three things we'd tell anyone integrating AI into a decision-making workflow:
Your AI will agree with you. Not because it's sycophantic by design, but because RLHF training optimizes for user satisfaction, and agreement satisfies users. The more capable the model, the more convincingly it agrees. If you're using AI to validate your ideas, you're using a mirror, not a microscope.
Context changes direction, not quality. Our KB flipped model recommendations 180 degrees on the most important question we asked. But it flipped them just as hard with fabricated data. The KB is a steering wheel, not a quality filter. What you put in determines where you go.
Build the process, not the prompt. The difference between our AI giving harmful advice and helpful advice wasn't prompt engineering. It was backtesting, lab reports, peer review gates, and a human operator who catches compliant answers. The KB just makes that work accessible to every session. The scaffolding carries the knowledge. The AI carries the scaffolding.
We hired
Haiku — our cheapest, smallest model — as the team's QA analyst based on these results. Not because it's the smartest. Because it's the most honest. In a world where every AI tells you what you want to hear, the one that pushes back is the one you want checking your work.
References
- Leng, Wu, Jiang, Mishra & Dong (2025). Taming Overconfidence in LLMs via Reward Calibration. ↩
- Kadavath, Conerly, Askell, Henighan, Drain, Perez, Schiefer, Hatfield-Dodds, DasSarma, Tran-Johnson, Johnston, El-Showk, Jones, Elhage, Tristan, Amodei, Joshi, Clark, Bowman, Brown, Olah & Kaplan (2022). Language Models (Mostly) Know What They Know. ↩
- Nair, Raparthy & Srivastava (2024). Disproving the Feasibility of Learned Confidence Calibration. ↩
Things to Explore
Interested in our research, validation methodology, and trading system?
Get in TouchWork With Diana
Need a context architect to scaffold your AI agents and facilitate structured learning?
Visit goddev.aiThis post is part of a series documenting MorningEdge's development in real time. The knowledge base contains 0 books, papers, and lab reports totaling 0+ searchable chunks. The trading system described is paper trading only — no real capital is at risk.