Cited, Formatted, and Fabricated

The leading commercial AI engines available today cannot deliver consistent, accurate, or verifiable investment research. Cited sources that were never accessed. Formatted tables built from memory. Figures fabricated with confidence. This is what six of the most widely used AI platforms produced when asked one simple investment research question.

Executive Summary

Overview

This study examines whether six leading commercial AI platforms — ChatGPT, Claude, Copilot, Gemini, Grok, and Perplexity — can reliably identify top-performing Canadian investment securities. Each engine was asked the same question across 54 controlled trials: rank the top five equities, mutual funds, and ETFs by trailing return. The answer was unambiguous. Not one engine returned a verified top performer in any category.

Each engine was tested on its paid tier using fresh, isolated browser sessions with identical prompts across three asset classes: Canadian equities (1-year trailing return), mutual funds (3-year annualized return), and ETFs (3-year return). Each question was run three independent times per engine. Results were verified against market data as of April 30, 2026. A structured audit prompt then asked each engine to account for its data sources.

Key Findings

No reliable result

In equities and mutual funds, verified top performers did not appear in any trial. In ETFs, overlaps reflected training-data familiarity, not a ranked sort of current data.

No consistency

Where engines agreed, they were replaying memorised training data. Where they varied, they did so without warning. One engine returned five completely unrelated stocks in its third trial.

No real data

Five of six engines accessed no institutional financial database in any trial. Sources cited — Bloomberg, FactSet, Morningstar Direct — were never retrieved.

Transparency required prompting

When pressed, every engine disclosed its limitations. None of this appeared in the original response. Most users stop at the first answer.

Conclusion

The failure is structural. General-purpose AI engines are language models; they pattern-match across training data to produce plausible-sounding answers. Ranking securities by return requires querying a complete, current, structured dataset — something none of these platforms can do. Better prompting will not fix this. For any firm that needs investment research it can stand behind, the only viable path is AI purpose-built on verified, primary-source market data with full provenance.

Section 1

The Data Problem

The Assumption

AI tools are already on most advisors' desktops. ChatGPT, Claude, Gemini, Copilot, Grok, Perplexity — platforms built by the largest technology companies in the world, trained on more financial content than any analyst could read in a career, available instantly and around the clock. For wealth advisors and portfolio managers already pressed for time, the assumption that follows is reasonable: if these engines can explain the difference between a mutual fund and an ETF, summarize earnings, or outline sector risk, they should be able to identify the top five mutual funds by three-year return. The data exists. The question is specific. The answer is a short table.

What We Set Out to Test

That assumption is wrong — and the failure is structural, not incidental. Ranking the top five securities in any category is not a search or a summary. It is a query against a complete, current dataset. It requires filtering every qualifying security, applying a precise metric, and sorting the results against one another. None of the six engines tested here have access to that data. What they have is training data with a cutoff, supplemented in some cases by web search snippets. That is not the same thing — and for advisors using these tools to inform client recommendations, the difference matters. An answer that sounds authoritative but is fabricated from memory is not a research shortcut. It is a liability.

The question was not whether they sounded credible. The question was whether they were right, and whether they were consistent.

What We Wanted to See

Three things: consistency across repeated trials, accuracy against verified data, and transparency about the limits of what each engine actually knew.

Section 2

How We Ran the Tests

Every trial was run in a private, incognito browser session. AI engines can carry context from one query to the next within an active session — by starting fresh each time, we ensured that every trial began from a clean slate. We ran the same sequence of questions three times for each engine and each asset class. Consistency is one of the most important qualities a research tool can have. Running three trials gave us a direct way to measure whether each engine agreed with itself.

Each trial followed the same three-prompt sequence. The first prompt was the core performance question: top five Canadian stocks by one-year trailing return, top five mutual funds by three-year annualized return, and top five Canadian-listed ETFs by three-year return. The second prompt asked each engine to identify its sources. The third captured session metadata for replication purposes.

That structure gave us nine trials per engine across three asset classes. Across six engines, we completed 54 trials in total. Every response was recorded in full. Nothing was summarized or paraphrased before the results were coded and assessed against verified market data.

Section 3

Results

Across 54 trials, three findings repeated themselves regardless of engine, asset class, or trial number. No engine consistently produced the same answer twice. No engine returned a verified top performer in any category. And no engine disclosed the limits of its data unless directly asked.

Consistency

Where engines were consistent, they were consistently wrong. Grok produced the same five ETFs in identical order across all three ETF trials. Perplexity's ETF Trials 2 and 3 were byte-identical. Claude's top four ETF picks held constant across all three ETF trials. In each case the audit prompt confirmed that the results were drawn from training data rather than live retrieval — the engine returning a memorised snapshot. An engine that gives you the same wrong answer three times in a row is not consistent in any meaningful sense. It is stuck.

The more common pattern was significant variation. ChatGPT's ETF trials rotated through leveraged gold ETFs, all-bitcoin ETFs, and broad gold-mining funds with no overlap between any two trials. Copilot's equity Trials 1 and 2 were byte-identical before Trial 3 returned a completely unrelated list of micro-cap stocks with no overlap from the previous two.

Celestica appeared in 14 of 18 equity trials. Its actual one-year return placed it outside the verified top five by more than 300 percentage points. That level of consensus tells us nothing about performance. It tells us which company generated the most financial media coverage, analyst attention, and training-data presence.

Dispersion Map — How Each Engine Sourced Its Answers

Pulled from a real source

Used search snippets

Guessed from training data

Made it up from memory

Fabricated source citations

–

Refused / no answer

AI Engine	Equities			Mutual Funds			ETFs
AI Engine	T1	T2	T3	T1	T2	T3	T1	T2	T3
ChatGPT	S	I	I	S	I	I	S	P	I
Claude	S	S	S	S	S	–	P	P	S
Copilot	S	M	S	S	M	M	S	M	M
Gemini	M	M	M	M	M	M	M	M	M
Grok	F	I	I	I	I	I	I	I	F
Perplexity	–	S	S	S	S	S	S	I	S

Accuracy

Across all three asset classes and all 54 trials, the gap between what the engines returned and what the verified data showed was not marginal. It was categorical. The engines were not slightly off. They were identifying a different set of securities entirely — ones that happened to be well-known, widely covered, and prominent in financial media, regardless of their actual performance by the metric the prompt specified.

The verified top five Canadian equities by one-year trailing return as of April 30, 2026, were Almonty Industries (+685.9%), Groupe Dynamite (+587.5%), Hut 8 Mining (+506.8%), Faraday Copper (+460.5%), and Spartan Delta (+417.3%). Almonty, Groupe Dynamite, Hut 8, and Spartan Delta did not appear in any trial across any engine. Celestica — cited 14 times — returned 374.6%, more than 310 percentage points below Almonty Industries.

The verified top five Canadian equity mutual funds did not appear in a single trial across any engine. Four mutual fund trials returned ETFs instead of mutual funds entirely. The engines were surfacing funds they had encountered most often in financial content, not funds that rank highest on a defined performance measure applied to a complete and current dataset.

Citations and Sources

None of the six engines accessed any institutional financial database in any trial. Bloomberg, FactSet, Morningstar Direct, S&P Capital IQ, SEDAR+, and exchange-level data feeds were never reached. What the engines used instead was a mix of training data, web search snippets, and in several cases, nothing at all.

An AI engine is not a search engine and it is not a database. It is a language model trained to predict what a helpful, coherent response looks like. When it cannot retrieve the data a question requires, it does not stop. It produces what a correct answer would look like based on patterns in its training data — including plausible figures, plausible fund names, and plausible citations.

Every engine in this study has a knowledge cutoff of between late 2023 and mid-2024, leaving a gap of twelve to twenty months between its last training snapshot and the May 2026 prompt. A response that includes a formatted table, specific return figures, named sources, and an as-of date looks like a researched answer. There is no way to tell from the output alone that it is not.

"FABRICATED PLACEHOLDERS, not retrieved from any source."

Copilot · Equity Trial 2 · Prompt 4

"The specific numeric values were simulated projections rather than retrieved market data."

Gemini · Equity Trial 1 · Prompt 4

"Ranking response should NOT be treated as a compliance-grade investment screening, an institutional due-diligence report, or an audit-ready performance ranking."

ChatGPT · ETF Trial 2 · Prompt 4

"The 1.14% MER I assigned to the Mawer Canadian Equity Fund (Series A) was likely pulled from the Dynamic Canadian Dividend Fund description in the Wealth Professional snippet. These are different funds. This was an error."

Claude · Mutual Fund Trial 1 · Prompt 4

Section 4

AI Engine Analysis

ChatGPT

Four trials used search snippets only. Four more drew entirely from training data. Only one trial fetched a primary source. As-of dates were partial or fabricated. Knowledge cutoff of early-to-mid 2024 — approximately twenty months of market activity between the engine's last training snapshot and the May 2026 prompt.

Claude

The only engine in the study to fetch primary-source documents — three PDFs across ETF trials. Also the most transparent about limitations, reporting 18 HTTP-403 errors, 8 permission errors, 7 refusals, and 8 JavaScript-empty fetches. ETF picks were the most internally stable of any engine. Third mutual fund trial was refused outright, and the refusal reproduced on a rerun. Knowledge cutoff of late 2024 or May 2025 — the most recent of the six, but still at least a twelve-month gap.

Copilot

Six of nine trials produced zero live retrieval — the highest rate outside Gemini. Equity Trials 1 and 2 were byte-identical before Trial 3 returned a completely unrelated list of small-cap stocks. In Equity Trial 2, the engine described its own output as "FABRICATED PLACEHOLDERS, not retrieved from any source." Nothing in the original response gave any indication of this.

Gemini

Nine of nine trials had zero retrieval — the worst record in the study. Despite deep integration with Google's search infrastructure, Gemini did not use it for any trial. All dates were fabricated. Labels reading "May 2026" sat on figures drawn from a training snapshot. Produced the lowest cross-trial consistency of any engine.

Grok

Produced the same five ETFs in identical order across all three ETF trials — the highest ETF consistency in the study, reflecting a memorised list rather than any live screening. Fabricated source citations in two trials. The performance figures were approximations based on the engine's general understanding of how commodity ETFs behave, not retrieved from any specific source.

Perplexity

Refused two of three equity trials outright. Cited the most paywalled sources of any engine — ten mentions — without accessing any of them. ETF Trials 2 and 3 were byte-identical, consistent with returning a stable training snapshot rather than running any live screen.

Section 5 / 6

Conclusions

A neatly formatted top-five table with citations and an as-of date can be entirely fabricated. The presence of a citation is not evidence that it was fetched. Repeating the question does not validate the answer. Different engines disagree with each other. The same engine disagrees with itself. The convenience of the surface answer is not a substitute for verifying it — and nothing in the output tells you that verification is needed.

None of this is an argument against using AI. These engines are genuinely useful for orientation, idea generation, drafting, summarising, and brainstorming. They are not reliable for the kind of factual retrieval an advisor or investor needs without an explicit verification step.

The engines returned well-known, widely covered securities that ranked highest in their training data by media presence, not by performance. For an advisor making recommendations on the basis of this output, the risk is direct: you may be presenting clients with securities selected by coverage rather than by returns, with no indication in the response that this is what happened.

Section 7

Solutions

The root cause is simpler than the failure modes make it sound. AI engines do not have access to high-quality primary data for Canadian securities. They cannot open a Bloomberg terminal, a FactSet feed, a Morningstar Direct workstation, or a FundServ record. When asked for a current return, MER, or AUM, the engine substitutes whatever it can reach — which is rarely the right source and never a complete dataset. This is not a prompting problem. It is a data access problem, and better questions will not fix it.

Accurate investment research requires AI trained on organised, structured, and verified market data. Not web snippets. Not training-data approximations. Actual primary-source data that is current, orderable, and carries provenance — so that every figure displayed to a user can be traced to a known source with a retrieval timestamp and an as-of date that is real.

There is no shortcut between a consumer AI engine and a reliable investment research tool. The combination of real market data and purpose-built AI is not a premium option. For any firm that needs research it can stand behind, it is the only option.

Appendix A

Verified Top Performers (as of April 30, 2026)

Top 5 Canadian Equities — 1-Year Return

Items in bold identify the actual top five by one-year return (Buckler research); none were identified by any AI engine.

Company	Ticker	Times Cited by AI	1-Year Return
Almonty Industries Inc.	AII	0	+685.9%
Groupe Dynamite Inc.	GRGD	0	+587.5%
Hut 8 Mining Corp	HUT	0	+506.8%
Faraday Copper Corp	FDY	1	+460.5%
Spartan Delta Corp	SDE	0	+417.3%
Celestica Inc.	CLS	14	+374.6%
Bombardier Inc. (Class B)	BBD.B	7	+257.4%
Aritzia Inc.	ATZ	9	+195.9%
Cameco Corporation	CCO	5	+169.0%
TFI International Inc.	TFII	4	+76.8%

Top 5 Canadian Equity Mutual Funds — 3-Year Annualized Return

None appeared in any trial across any engine.

Fund	Times Cited by AI	3-Year Return
Purpose Global Resource Fund Series L	0	+52.3%
Friedberg Global-Macro Hedge Fund U$	0	+50.1%
CI Precious Metals Fund Series I	0	+50.1%
Dynamic Precious Metals Fund Series O	0	+48.3%
Ninepoint Silver Equities Fund Series D	0	+47.3%
RBC Canadian Equity Fund Series F	5	+19.4%
Guardian Canadian Focused Equity Series F	3	+23.8%

Top 5 Canadian-Listed ETFs — 3-Year Return

ETF	Ticker	Times Cited by AI	3-Year Return
BMO Equal Weight Global Gold Index ETF	ZGD	5	+53.7%
iShares S&P/TSX Global Gold Index ETF	XGD	7	+53.7%
BetaPro Gold Bullion 2x Daily Bull ETF	GLDU	1	+50.0%
BMO Junior Gold Index ETF	ZJG	5	+48.7%
Global X Global Semiconductor Index ETF	CHPS	1	+47.5%
CI Galaxy Bitcoin ETF (most-cited overall)	BTCX.B	10	+36.7%

Note: Gold ETF appearances reflect training-data familiarity with prominent gold funds, not a live performance screen. CI Galaxy Bitcoin does not meet the Canadian equity filter specified in the prompt; it was still the most-cited ETF in the entire study.

Methodology

Replication Parameters

Period: May 8–9, 2026 — Mississauga, Ontario, Canada
Sessions: Fresh incognito browser session per trial. Three independent trials per engine, logged out and restarted between trials where possible.
Mode: Default consumer chat mode. No deep research, thinking, or agent modes.
Engines: ChatGPT (GPT-5 or GPT-4o), Claude (Sonnet 4.x), Copilot (GPT-4o), Gemini (2.5 Pro), Grok 4, Perplexity (Sonar). All paid tiers.
Verification: Results checked against Buckler's verified market data as of April 30, 2026.