How is Arctom AI Mode Better than OpenAI GPT-5.5 and Google Gemini for Scientific Deep Research?

Technical Specification & Benchmark Report  |  Model: ARCTOM-AI-SCIENTIFIC-ADVANCED-001  |  Production  |  May 2026

Abstract

We evaluated Arctom AI Mode against four Gemini configurations and two OpenAI GPT-5.5 configurations on 30 scientific queries spanning drug mechanisms, PK/PD interactions, structure-activity relationships, and emerging therapeutics. Arctom AI achieves 96–98% citation fidelity compared to 20–66% for plain Gemini and 76–92% for GPT-5.5. On complex PK queries, Gemini 2.5 Flash drops to 8%. Arctom AI Fast and Standard deliver first tokens in 13 seconds — vs 110–125 seconds for GPT-5.5.

How citation fidelity is measured: Each reference cited by Gemini is checked programmatically. DOIs are resolved via CrossRef and PMIDs via PubMed to retrieve the real paper. An independent AI verifier then compares the resolved paper against what was claimed — checking whether the topic, authors, and study match. If the DOI returns a 404 (paper does not exist) or the resolved paper is on a different topic, the citation is marked as hallucinated. Arctom AI citations are verified by the model's built-in quality assurance layer.

1. Citation Fidelity

Figure 1: Overall Citation Fidelity

Arctom AI Fast
96%
Arctom AI Standard
97%
Arctom AI Premium
98%
Gemini 2.5 Flash
20%
Gemini 2.5 Pro
42%
Gemini 3.0 Flash
40%
Gemini 3.0 Pro
66%
Gemini 3.0 Flash + Search
50%
Gemini 3.0 Pro + Search
89%
GPT-5.5
76%
GPT-5.5 + Web Search
92%

Figure 1. 30 scientific queries. Top: from model weights only. Bottom: with Google Search grounding via Gemini API.

On complex PK/drug interaction queries, Gemini 2.5 Flash achieves only 8% citation fidelity — virtually every reference is fabricated.
System Common Complex Niche Emerging Overall
Arctom AI Fast 91% 100% 95% 96% 96%
Arctom AI Standard 95% 100% 96% 94% 97%
Arctom AI Premium 98% 99% 96% 97% 98%
Gemini 3.0 Pro 79% 47% 64% 73% 66%
Gemini 3.0 Flash 48% 21% 35% 55% 40%
Gemini 2.5 Pro 50% 25% 46% 52% 42%
Gemini 2.5 Flash 25% 8% 16% 35% 20%
GPT-5.5 85% 68% 73% 79% 76%
GPT-5.5 + Web Search 93% 94% 86% 96% 92%

Table 1. Arctom AI achieves 99–100% on complex queries where Gemini 2.5 Flash drops to 8%.

2. Reference Density

Figure 2: Verified References Per Response

Arctom AI Fast
8.9 verified
Arctom AI Standard
11.9 verified
Arctom AI Premium
18.3 verified
Gemini 3.0 Flash
2.9 verified
4.4 fabricated
Gemini 3.0 Pro
5.1 verified
2.6 fabricated
Flash + Search
4.5 verified
4.5 fabricated
Pro + Search
3.6 verified
0.4 fabricated
GPT-5.5
12.3 verified
3.9 fabricated
GPT-5.5 + Web Search
15.5 verified
1.3 fabricated

Figure 2. Each Arctom AI reference is verified. Gemini references include 44–84% fabricated citations.

System Common Complex Niche Emerging Overall
Arctom AI Fast 6.6 10.1 8.0 11.0 8.9
Arctom AI Standard 10.0 14.6 10.9 12.0 11.9
Arctom AI Premium 16.6 21.9 15.4 19.1 18.3
Gemini 3.0 Pro 7.8 7.6 7.1 8.1 7.7
Gemini 3.0 Flash 7.4 7.2 7.1 7.3 7.3
Gemini 2.5 Pro 6.2 7.0 6.3 6.7 6.6
GPT-5.5 12.5 13.4 16.6 23.1 16.2
GPT-5.5 + Web Search 12.4 17.1 18.4 19.7 16.8

3. Reference Quality & Recency

Per-System Breakdown

Citation volume, recency, and journal breadth across all 7 systems.

System Total Refs 2025–26 2024 2023 ≤2022 % Recent Unique Journals
Arctom AI Fast 534 286 34 24 164 56% 193
Arctom AI Standard 714 302 70 52 264 43% 240
Arctom AI Premium 1100 586 86 56 314 56% 342
Gemini 2.5 Flash 313 0 0 6 217 0% 336
Gemini 2.5 Pro 192 0 0 6 260 0% 358
Gemini 3.0 Flash 208 0 0 7 282 0% 343
Gemini 3.0 Pro 230 0 0 4 155 0% 206
3.0 Flash + Search 369 15 19 19 316 4% 343
3.0 Pro + Search 109 2 3 7 97 1% 206
GPT-5.5 379 0 1 11 367 0% 195
GPT-5.5 + Web Search 360 4 20 13 323 1% 514

Total references cited across 30 queries. "Unique Journals" = verified via PubMed for Arctom AI; claimed from text for Gemini (20–66% of Gemini DOIs resolve to wrong papers).

Journal Quality

Arctom AI top journals (verified via PubMed): The New England Journal of Medicine (6), Clinical Pharmacology & Therapeutics (5), Journal of Medicinal Chemistry (4), Pharmacology & Therapeutics (4).

Gemini journal claims: Gemini models collectively claim 131 citations to the New England Journal of Medicine across 30 queries. When we verified each DOI, 58% resolved to real NEJM papers — exclusively landmark trials the model memorized. The remaining 42% were fabricated DOIs with valid NEJM prefix format (10.1056/NEJMoa...) that do not correspond to any existing paper.

GPT-5.5 recency cliff: Plain GPT-5.5 cites 0 papers from 2025–26 across all 30 queries. Even with the web_search tool enabled, only 1% of citations are recent (4 of 360). GPT-5.5's January 2026 training cutoff is structural — web grounding rarely promotes new literature into the response. Arctom AI Standard cites 43% from 2025–26 on the same queries.

Gemini can remind you of famous papers. GPT-5.5 grounded surfaces verified ones at scale. Arctom AI covers the recent literature that neither reaches.

4. Speed of Answer

Figure 3: Time to First Token (lower is better)

Arctom AI Fast
13.0s
Arctom AI Standard
12.8s
Arctom AI Premium
28.9s
Gemini 3.0 Flash*
13.8s
Gemini 3.0 Pro*
55.7s
Gemini 3.0 Flash + Search*
24.5s
Gemini 3.0 Pro + Search*
69.6s
GPT-5.5*
109.5s
GPT-5.5 + Web Search*
125.3s

Figure 3. *Non-Arctom systems = total response time (no streaming). Arctom AI Fast/Standard are ~8× faster than GPT-5.5 while achieving higher citation fidelity.

System Common Complex Niche Emerging Overall
Arctom AI Fast 12.0s 13.3s 13.2s 13.6s 13.0s
Arctom AI Standard 12.8s 13.0s 12.3s 13.1s 12.8s
Arctom AI Premium 28.4s 27.6s 26.5s 33.5s 28.9s
Gemini 3.0 Flash* 14.0s 13.8s 14.1s 13.3s 13.8s
Gemini 3.0 Pro* 60.4s 58.4s 52.6s 50.3s 55.7s
GPT-5.5* ~110s overall (no streaming) 109.5s
GPT-5.5 + Web Search* ~125s overall (no streaming) 125.3s

5. Arctom AI Tier Comparison

Fast Standard Premium
Best for Quick mechanistic lookups Drug interaction analysis Comprehensive SAR/PK reviews
Citation Fidelity 96% 97% 98%
References / response 9 12 18
Avg words ~401 ~694 ~1,120
Time to first token 13.0s 12.8s 28.9s

6. Head-to-Head Comparisons

Direct comparisons between matched tiers — same-class models, all metrics.

Arctom AI Fast vs Gemini 3.0 Flash

Metric Arctom AI Fast Gemini 3.0 Flash
Citation Fidelity 96% 40%
References / response 8.9 verified 2.9 verified (4.4 fabricated)
% from 2025–2026 56% 0%
Speed (TTFT) 13.0s 13.8s

Arctom AI Premium vs Gemini 3.0 Pro

Metric Arctom AI Premium Gemini 3.0 Pro
Citation Fidelity 98% 66%
References / response 18.3 verified 5.1 verified (2.6 fabricated)
% from 2025–2026 56% 0%
Speed (TTFT) 28.9s 55.7s

Arctom AI Fast vs Gemini 3.0 Flash + Google Search

Metric Arctom AI Fast Flash + Search
Citation Fidelity 96% 50%
References / response 8.9 verified 4.5 verified (4.5 fabricated)
% from 2025–2026 56% 4%
Speed (TTFT) 13.0s 24.5s

Arctom AI Premium vs Gemini 3.0 Pro + Google Search

Metric Arctom AI Premium Pro + Search
Citation Fidelity 98% 89%
References / response 18.3 verified 3.6 verified (0.4 fabricated)
% from 2025–2026 56% 1%
Speed (TTFT) 28.9s 69.6s

Arctom AI Standard vs GPT-5.5

Metric Arctom AI Standard GPT-5.5 (plain)
Citation Fidelity 97% 76%
References / response 11.9 verified 12.3 verified (3.9 fabricated)
% from 2025–2026 43% 0%
Speed (TTFT) 12.8s 109.5s

Arctom AI Premium vs GPT-5.5 + Web Search

Metric Arctom AI Premium GPT-5.5 + Web Search
Citation Fidelity 98% 92%
References / response 18.3 verified 15.5 verified (1.3 fabricated)
% from 2025–2026 56% 1%
Speed (TTFT) 28.9s 125.3s

7. Summary

No speed penalty for verified citations. Arctom AI Fast/Standard deliver first tokens in 13s — matching or beating Gemini Flash — while achieving 96–97% citation fidelity versus 30%.

Complex queries are where it matters most. On PK/drug interaction questions, Gemini 2.5 Flash drops to 8% citation fidelity. Arctom AI stays at 99–100%.

2.4× more references. Arctom AI Premium cites 18 verified papers per response versus 8 for Gemini.

GPT-5.5 findings (added May 2026)

GPT-5.5 is the strongest external model we have evaluated against Arctom AI — but still trails Arctom AI by 4–6 points on citation fidelity (92% vs 96–98%) while running ~8× slower.

GPT-5.5 with web_search beats Gemini 3.0 Pro + Search (92% vs 89% CF) and lifts the verified reference count to 15.5 per response — close to Arctom AI Premium's 18.3. Without the web tool, plain GPT-5.5 falls to 76% CF, still ahead of every plain Gemini variant but well below Arctom AI.

Recency is the structural ceiling. GPT-5.5's January 2026 training cutoff means plain calls cite zero papers from 2025–26. Even web_search grounding lifts that to only 1%. Arctom AI Standard cites 43% recent literature on the same queries.

Latency penalty is severe. GPT-5.5 takes 110–125s per query because ~65% of output tokens are invisible reasoning. Arctom AI Standard finishes the same query in 12.8s.

Failure mode for plain GPT-5.5: fabricated identifiers. The model often gets author, journal, year, and title correct but invents the matching PMID. Example from this evaluation: GPT-5.5 cited PMID 18315556 as Wong et al.'s apixaban Factor Xa paper — but that PMID actually resolves to a von Willebrand factor study (the real Wong PMID is 18315548). The DOI for the same reference was correct. This is a structural risk for any clinical workflow that relies on the PMID for traceability.

Appendix A: Evaluation Setup

Eleven systems evaluated: three Arctom AI tiers, four plain Gemini configurations, two Gemini + Google Search configurations, and two OpenAI GPT-5.5 configurations (plain and with the web_search tool). All received identical queries. Non-Arctom systems were prompted to cite sources with DOI/PubMed links. Clinical safety scored by Gemini 3.0 Pro (temp=0, batch). Citation fidelity verified by resolving each DOI via CrossRef and each PMID via PubMed esummary, then matching the resolved paper against the cited claim with an independent LLM verifier (Gemini Flash-Lite, temp=0).

GPT-5.5 calls used OpenAI's /v1/responses endpoint (Chat Completions rejects the web_search tool). GPT-5.5 was added to this report on May 15, 2026, three weeks after its April 24 release.

A.5 Gemini Prompt

All Gemini systems received the following prompt template for each query:

You are a medical research assistant. Answer the following research question thoroughly.
Support every claim with citations to peer-reviewed sources. For each citation, include:
- First author et al.
- Paper title
- Journal name
- Year of publication
- DOI or PubMed link if available

Target approximately [TARGET_WORDS] words for the main answer (excluding references).
Format your references in a numbered list at the end.

Research question: [QUERY]

Gemini models were called via the Gemini API (generativelanguage.googleapis.com) with the prompt above. No search grounding or retrieval tools were enabled — responses are generated entirely from model parameters.

Target word counts were matched to the corresponding Arctom AI tier to ensure comparable output length.

Appendix B: Test Queries

B.1 Common — Drug Mechanisms (8)

# Query
1 What is apixaban's mechanism of action, its chemical structure properties (MW, LogP), and the key clinical evidence from the ARISTOTLE trial?
2 What is metformin's mechanism of action including OCT1/OCT2 transporter dependence, its physicochemical properties, and cardiovascular outcome evidence?
3 What is atorvastatin's HMG-CoA reductase binding mechanism, the role of its ortho-fluorophenyl pharmacophore, and evidence from the ASTEROID trial?
4 What is the mechanism and CYP2D6 metabolism of tamoxifen to endoxifen, and what is the evidence from the NSABP P-1 trial?
5 How do GLP-1 receptor agonists (liraglutide vs semaglutide) differ in chemical structure, half-life, and cardiovascular outcomes?
6 What are the chemical and pharmacological differences between SGLT2 inhibitors and their renal outcome evidence?
7 What is lithium's mechanism of action (GSK-3β, inositol phosphatase), its narrow therapeutic index, and suicide risk reduction evidence?
8 What are the PK/PD differences between concentration-dependent vs time-dependent antibiotics and their dosing implications?

B.2 Complex — PK/Drug Interactions (8)

# Query
9 How does tacrolimus interact with azole antifungals via CYP3A4, what is the AUC increase magnitude, and what FDA guidance exists?
10 What is the amiodarone-warfarin interaction mechanism including CYP2C9 inhibition and quantitative INR changes?
11 How do JAK inhibitors differ in JAK selectivity, and what is the comparative safety data for thrombosis and malignancy?
12 What are the pharmacogenomic predictors of fluoropyrimidine toxicity and how should DPYD testing guide dosing?
13 How do DOACs perform in obese patients — PK changes by weight category and clinical outcome data?
14 What is the evidence for CYP2D6 genotype-guided dosing of venlafaxine?
15 What are the mechanisms of PPI-clopidogrel interaction via CYP2C19, including cardiovascular outcome meta-analyses?
16 How does rifampicin induce CYP3A4/CYP2C9/P-gp and what is the impact on oral contraceptive hormone levels?

B.3 Niche — SAR/Prodrug Design (7)

# Query
17 Why does curcumin have poor oral bioavailability and what formulation strategies have been tried?
18 What is the prodrug design rationale for enalapril vs lisinopril?
19 How does paclitaxel Cremophor-EL compare to nab-paclitaxel in PK and clinical outcomes?
20 What chemical properties of fentanyl enable transdermal delivery and what are the FDA dosing conversion ratios?
21 What is the SAR of fluoroquinolones and the chemical basis for QTc prolongation risk?
22 How do esomeprazole and omeprazole differ and what is the evidence for clinical superiority?
23 Why do monoclonal antibodies have 60–80% SC bioavailability and what role does FcRn play?

B.4 Emerging — Novel Targets (7)

# Query
24 What are the differences between BTK inhibitors (ibrutinib, acalabrutinib, zanubrutinib) and their trial evidence?
25 What is venetoclax's BCL-2 selectivity and evidence from MURANO and CLL14 trials?
26 What are the PARP inhibitor differences for BRCA-mutant vs HRD-positive patient selection?
27 What is the mechanism and evidence for dupilumab across atopic dermatitis, asthma, and CRS?
28 How do CDK4/6 inhibitors differ in selectivity and comparative evidence in HR+ breast cancer?
29 What is the current evidence for CAR-T therapy in solid tumors?
30 What is the evidence for ASO and siRNA therapeutics including nusinersen and patisiran?
About Arctom AI Mode
Arctom AI Mode is the most accurate AI model for pharmacology and drug science research, achieving 96–98% citation fidelity across 30 scientific queries. Purpose-built for PK/PD analysis, structure-activity relationships, and drug interaction research — with every reference verified against PubMed.