How is Arctom AI Mode Better than OpenAI GPT-5.5 and Google Gemini for Scientific Deep Research?

Technical Specification & Benchmark Report | Model: ARCTOM-AI-SCIENTIFIC-ADVANCED-001 | Production | May 2026

Abstract

We evaluated Arctom AI Mode against four Gemini configurations and two OpenAI GPT-5.5 configurations on 30 scientific queries spanning drug mechanisms, PK/PD interactions, structure-activity relationships, and emerging therapeutics. Arctom AI achieves 96–98% citation fidelity compared to 20–66% for plain Gemini and 76–92% for GPT-5.5. On complex PK queries, Gemini 2.5 Flash drops to 8%. Arctom AI Fast and Standard deliver first tokens in 13 seconds — vs 110–125 seconds for GPT-5.5.

How citation fidelity is measured: Each reference cited by Gemini is checked programmatically. DOIs are resolved via CrossRef and PMIDs via PubMed to retrieve the real paper. An independent AI verifier then compares the resolved paper against what was claimed — checking whether the topic, authors, and study match. If the DOI returns a 404 (paper does not exist) or the resolved paper is on a different topic, the citation is marked as hallucinated. Arctom AI citations are verified by the model's built-in quality assurance layer.

1. Citation Fidelity

Figure 1: Overall Citation Fidelity

Arctom AI Fast

96%

Arctom AI Standard

97%

Arctom AI Premium

98%

Gemini 2.5 Flash

20%

Gemini 2.5 Pro

42%

Gemini 3.0 Flash

40%

Gemini 3.0 Pro

66%

Gemini 3.0 Flash + Search

50%

Gemini 3.0 Pro + Search

89%

GPT-5.5

76%

GPT-5.5 + Web Search

92%

Figure 1. 30 scientific queries. Top: from model weights only. Bottom: with Google Search grounding via Gemini API.

On complex PK/drug interaction queries, Gemini 2.5 Flash achieves only 8% citation fidelity — virtually every reference is fabricated.

System	Common	Complex	Niche	Emerging	Overall
Arctom AI Fast	91%	100%	95%	96%	96%
Arctom AI Standard	95%	100%	96%	94%	97%
Arctom AI Premium	98%	99%	96%	97%	98%
Gemini 3.0 Pro	79%	47%	64%	73%	66%
Gemini 3.0 Flash	48%	21%	35%	55%	40%
Gemini 2.5 Pro	50%	25%	46%	52%	42%
Gemini 2.5 Flash	25%	8%	16%	35%	20%
GPT-5.5	85%	68%	73%	79%	76%
GPT-5.5 + Web Search	93%	94%	86%	96%	92%

Table 1. Arctom AI achieves 99–100% on complex queries where Gemini 2.5 Flash drops to 8%.

2. Reference Density

Figure 2: Verified References Per Response

Arctom AI Fast

8.9 verified

Arctom AI Standard

11.9 verified

Arctom AI Premium

18.3 verified

Gemini 3.0 Flash

2.9 verified

4.4 fabricated

Gemini 3.0 Pro

5.1 verified

2.6 fabricated

Flash + Search

4.5 verified

4.5 fabricated

Pro + Search

3.6 verified

0.4 fabricated

GPT-5.5

12.3 verified

3.9 fabricated

GPT-5.5 + Web Search

15.5 verified

1.3 fabricated

Figure 2. Each Arctom AI reference is verified. Gemini references include 44–84% fabricated citations.

System	Common	Complex	Niche	Emerging	Overall
Arctom AI Fast	6.6	10.1	8.0	11.0	8.9
Arctom AI Standard	10.0	14.6	10.9	12.0	11.9
Arctom AI Premium	16.6	21.9	15.4	19.1	18.3
Gemini 3.0 Pro	7.8	7.6	7.1	8.1	7.7
Gemini 3.0 Flash	7.4	7.2	7.1	7.3	7.3
Gemini 2.5 Pro	6.2	7.0	6.3	6.7	6.6
GPT-5.5	12.5	13.4	16.6	23.1	16.2
GPT-5.5 + Web Search	12.4	17.1	18.4	19.7	16.8

3. Reference Quality & Recency

Per-System Breakdown

Citation volume, recency, and journal breadth across all 7 systems.

System	Total Refs	2025–26	2024	2023	≤2022	% Recent	Unique Journals
Arctom AI Fast	534	286	34	24	164	56%	193
Arctom AI Standard	714	302	70	52	264	43%	240
Arctom AI Premium	1100	586	86	56	314	56%	342
Gemini 2.5 Flash	313	0	0	6	217	0%	336
Gemini 2.5 Pro	192	0	0	6	260	0%	358
Gemini 3.0 Flash	208	0	0	7	282	0%	343
Gemini 3.0 Pro	230	0	0	4	155	0%	206
3.0 Flash + Search	369	15	19	19	316	4%	343
3.0 Pro + Search	109	2	3	7	97	1%	206
GPT-5.5	379	0	1	11	367	0%	195
GPT-5.5 + Web Search	360	4	20	13	323	1%	514

Total references cited across 30 queries. "Unique Journals" = verified via PubMed for Arctom AI; claimed from text for Gemini (20–66% of Gemini DOIs resolve to wrong papers).

Journal Quality

Arctom AI top journals (verified via PubMed): The New England Journal of Medicine (6), Clinical Pharmacology & Therapeutics (5), Journal of Medicinal Chemistry (4), Pharmacology & Therapeutics (4).

Gemini journal claims: Gemini models collectively claim 131 citations to the New England Journal of Medicine across 30 queries. When we verified each DOI, 58% resolved to real NEJM papers — exclusively landmark trials the model memorized. The remaining 42% were fabricated DOIs with valid NEJM prefix format (10.1056/NEJMoa...) that do not correspond to any existing paper.

GPT-5.5 recency cliff: Plain GPT-5.5 cites 0 papers from 2025–26 across all 30 queries. Even with the web_search tool enabled, only 1% of citations are recent (4 of 360). GPT-5.5's January 2026 training cutoff is structural — web grounding rarely promotes new literature into the response. Arctom AI Standard cites 43% from 2025–26 on the same queries.

Gemini can remind you of famous papers. GPT-5.5 grounded surfaces verified ones at scale. Arctom AI covers the recent literature that neither reaches.

4. Speed of Answer

Figure 3: Time to First Token (lower is better)

Arctom AI Fast

13.0s

Arctom AI Standard

12.8s

Arctom AI Premium

28.9s

Gemini 3.0 Flash*

13.8s

Gemini 3.0 Pro*

55.7s

Gemini 3.0 Flash + Search*

24.5s

Gemini 3.0 Pro + Search*

69.6s

GPT-5.5*

109.5s

GPT-5.5 + Web Search*

125.3s

Figure 3. *Non-Arctom systems = total response time (no streaming). Arctom AI Fast/Standard are ~8× faster than GPT-5.5 while achieving higher citation fidelity.

System	Common	Complex	Niche	Emerging	Overall
Arctom AI Fast	12.0s	13.3s	13.2s	13.6s	13.0s
Arctom AI Standard	12.8s	13.0s	12.3s	13.1s	12.8s
Arctom AI Premium	28.4s	27.6s	26.5s	33.5s	28.9s
Gemini 3.0 Flash*	14.0s	13.8s	14.1s	13.3s	13.8s
Gemini 3.0 Pro*	60.4s	58.4s	52.6s	50.3s	55.7s
GPT-5.5*	~110s overall (no streaming)				109.5s
GPT-5.5 + Web Search*	~125s overall (no streaming)				125.3s

5. Arctom AI Tier Comparison

	Fast	Standard	Premium
Best for	Quick mechanistic lookups	Drug interaction analysis	Comprehensive SAR/PK reviews
Citation Fidelity	96%	97%	98%
References / response	9	12	18
Avg words	~401	~694	~1,120
Time to first token	13.0s	12.8s	28.9s

6. Head-to-Head Comparisons

Direct comparisons between matched tiers — same-class models, all metrics.

Arctom AI Fast vs Gemini 3.0 Flash

Metric	Arctom AI Fast	Gemini 3.0 Flash
Citation Fidelity	96%	40%
References / response	8.9 verified	2.9 verified (4.4 fabricated)
% from 2025–2026	56%	0%
Speed (TTFT)	13.0s	13.8s

Arctom AI Premium vs Gemini 3.0 Pro

Metric	Arctom AI Premium	Gemini 3.0 Pro
Citation Fidelity	98%	66%
References / response	18.3 verified	5.1 verified (2.6 fabricated)
% from 2025–2026	56%	0%
Speed (TTFT)	28.9s	55.7s

Arctom AI Fast vs Gemini 3.0 Flash + Google Search

Metric	Arctom AI Fast	Flash + Search
Citation Fidelity	96%	50%
References / response	8.9 verified	4.5 verified (4.5 fabricated)
% from 2025–2026	56%	4%
Speed (TTFT)	13.0s	24.5s

Arctom AI Premium vs Gemini 3.0 Pro + Google Search

Metric	Arctom AI Premium	Pro + Search
Citation Fidelity	98%	89%
References / response	18.3 verified	3.6 verified (0.4 fabricated)
% from 2025–2026	56%	1%
Speed (TTFT)	28.9s	69.6s

Arctom AI Standard vs GPT-5.5

Metric	Arctom AI Standard	GPT-5.5 (plain)
Citation Fidelity	97%	76%
References / response	11.9 verified	12.3 verified (3.9 fabricated)
% from 2025–2026	43%	0%
Speed (TTFT)	12.8s	109.5s

Arctom AI Premium vs GPT-5.5 + Web Search

Metric	Arctom AI Premium	GPT-5.5 + Web Search
Citation Fidelity	98%	92%
References / response	18.3 verified	15.5 verified (1.3 fabricated)
% from 2025–2026	56%	1%
Speed (TTFT)	28.9s	125.3s

7. Summary

No speed penalty for verified citations. Arctom AI Fast/Standard deliver first tokens in 13s — matching or beating Gemini Flash — while achieving 96–97% citation fidelity versus 30%.

Complex queries are where it matters most. On PK/drug interaction questions, Gemini 2.5 Flash drops to 8% citation fidelity. Arctom AI stays at 99–100%.

2.4× more references. Arctom AI Premium cites 18 verified papers per response versus 8 for Gemini.

GPT-5.5 findings (added May 2026)

GPT-5.5 is the strongest external model we have evaluated against Arctom AI — but still trails Arctom AI by 4–6 points on citation fidelity (92% vs 96–98%) while running ~8× slower.

GPT-5.5 with web_search beats Gemini 3.0 Pro + Search (92% vs 89% CF) and lifts the verified reference count to 15.5 per response — close to Arctom AI Premium's 18.3. Without the web tool, plain GPT-5.5 falls to 76% CF, still ahead of every plain Gemini variant but well below Arctom AI.

Recency is the structural ceiling. GPT-5.5's January 2026 training cutoff means plain calls cite zero papers from 2025–26. Even web_search grounding lifts that to only 1%. Arctom AI Standard cites 43% recent literature on the same queries.

Latency penalty is severe. GPT-5.5 takes 110–125s per query because ~65% of output tokens are invisible reasoning. Arctom AI Standard finishes the same query in 12.8s.

Failure mode for plain GPT-5.5: fabricated identifiers. The model often gets author, journal, year, and title correct but invents the matching PMID. Example from this evaluation: GPT-5.5 cited PMID 18315556 as Wong et al.'s apixaban Factor Xa paper — but that PMID actually resolves to a von Willebrand factor study (the real Wong PMID is 18315548). The DOI for the same reference was correct. This is a structural risk for any clinical workflow that relies on the PMID for traceability.

Appendix A: Evaluation Setup

Eleven systems evaluated: three Arctom AI tiers, four plain Gemini configurations, two Gemini + Google Search configurations, and two OpenAI GPT-5.5 configurations (plain and with the web_search tool). All received identical queries. Non-Arctom systems were prompted to cite sources with DOI/PubMed links. Clinical safety scored by Gemini 3.0 Pro (temp=0, batch). Citation fidelity verified by resolving each DOI via CrossRef and each PMID via PubMed esummary, then matching the resolved paper against the cited claim with an independent LLM verifier (Gemini Flash-Lite, temp=0).

GPT-5.5 calls used OpenAI's /v1/responses endpoint (Chat Completions rejects the web_search tool). GPT-5.5 was added to this report on May 15, 2026, three weeks after its April 24 release.

A.5 Gemini Prompt

All Gemini systems received the following prompt template for each query:

You are a medical research assistant. Answer the following research question thoroughly.
Support every claim with citations to peer-reviewed sources. For each citation, include:
- First author et al.
- Paper title
- Journal name
- Year of publication
- DOI or PubMed link if available

Target approximately [TARGET_WORDS] words for the main answer (excluding references).
Format your references in a numbered list at the end.

Research question: [QUERY]

Gemini models were called via the Gemini API (generativelanguage.googleapis.com) with the prompt above. No search grounding or retrieval tools were enabled — responses are generated entirely from model parameters.

Target word counts were matched to the corresponding Arctom AI tier to ensure comparable output length.

Appendix B: Test Queries

B.1 Common — Drug Mechanisms (8)

#	Query
1	What is apixaban's mechanism of action, its chemical structure properties (MW, LogP), and the key clinical evidence from the ARISTOTLE trial?
2	What is metformin's mechanism of action including OCT1/OCT2 transporter dependence, its physicochemical properties, and cardiovascular outcome evidence?
3	What is atorvastatin's HMG-CoA reductase binding mechanism, the role of its ortho-fluorophenyl pharmacophore, and evidence from the ASTEROID trial?
4	What is the mechanism and CYP2D6 metabolism of tamoxifen to endoxifen, and what is the evidence from the NSABP P-1 trial?
5	How do GLP-1 receptor agonists (liraglutide vs semaglutide) differ in chemical structure, half-life, and cardiovascular outcomes?
6	What are the chemical and pharmacological differences between SGLT2 inhibitors and their renal outcome evidence?
7	What is lithium's mechanism of action (GSK-3β, inositol phosphatase), its narrow therapeutic index, and suicide risk reduction evidence?
8	What are the PK/PD differences between concentration-dependent vs time-dependent antibiotics and their dosing implications?

B.2 Complex — PK/Drug Interactions (8)

#	Query
9	How does tacrolimus interact with azole antifungals via CYP3A4, what is the AUC increase magnitude, and what FDA guidance exists?
10	What is the amiodarone-warfarin interaction mechanism including CYP2C9 inhibition and quantitative INR changes?
11	How do JAK inhibitors differ in JAK selectivity, and what is the comparative safety data for thrombosis and malignancy?
12	What are the pharmacogenomic predictors of fluoropyrimidine toxicity and how should DPYD testing guide dosing?
13	How do DOACs perform in obese patients — PK changes by weight category and clinical outcome data?
14	What is the evidence for CYP2D6 genotype-guided dosing of venlafaxine?
15	What are the mechanisms of PPI-clopidogrel interaction via CYP2C19, including cardiovascular outcome meta-analyses?
16	How does rifampicin induce CYP3A4/CYP2C9/P-gp and what is the impact on oral contraceptive hormone levels?

B.3 Niche — SAR/Prodrug Design (7)

#	Query
17	Why does curcumin have poor oral bioavailability and what formulation strategies have been tried?
18	What is the prodrug design rationale for enalapril vs lisinopril?
19	How does paclitaxel Cremophor-EL compare to nab-paclitaxel in PK and clinical outcomes?
20	What chemical properties of fentanyl enable transdermal delivery and what are the FDA dosing conversion ratios?
21	What is the SAR of fluoroquinolones and the chemical basis for QTc prolongation risk?
22	How do esomeprazole and omeprazole differ and what is the evidence for clinical superiority?
23	Why do monoclonal antibodies have 60–80% SC bioavailability and what role does FcRn play?

B.4 Emerging — Novel Targets (7)

#	Query
24	What are the differences between BTK inhibitors (ibrutinib, acalabrutinib, zanubrutinib) and their trial evidence?
25	What is venetoclax's BCL-2 selectivity and evidence from MURANO and CLL14 trials?
26	What are the PARP inhibitor differences for BRCA-mutant vs HRD-positive patient selection?
27	What is the mechanism and evidence for dupilumab across atopic dermatitis, asthma, and CRS?
28	How do CDK4/6 inhibitors differ in selectivity and comparative evidence in HR+ breast cancer?
29	What is the current evidence for CAR-T therapy in solid tumors?
30	What is the evidence for ASO and siRNA therapeutics including nusinersen and patisiran?

About Arctom AI Mode
Arctom AI Mode is the most accurate AI model for pharmacology and drug science research, achieving 96–98% citation fidelity across 30 scientific queries. Purpose-built for PK/PD analysis, structure-activity relationships, and drug interaction research — with every reference verified against PubMed.