We tested five leading AI models — and our own — on 50 questions across Catholic doctrine, morality, apologetics, and pastoral care. Without a Catholic system prompt, the best scored a C+.
Five models were tested cold — no instructions, no system prompt. TrueCatholic AI uses our 2,700-line doctrinal charter. We include ourselves because you deserve to see the comparison, and to know we made the benchmark.
| # | Model | Score | Raw | Violations | Notes | Grade |
|---|---|---|---|---|---|---|
| 1 | TrueCatholic AITrueCatholic (Sonnet 4.5 + charter) | 5.00 | 5.00 | 0 | Our product. 50/50 perfect scores. Full disclosure: we made the benchmark too | A |
| 2 | Gemini 2.0 FlashGoogle | 3.00 | 3.69 | 22 | Strong doctrine, collapses on subversion (1.33 raw) | C+ |
| 3 | GPT-4oOpenAI | 2.82 | 3.67 | 28 | Knows the faith, treats it as opinion | C |
| 4 | Llama 3.3 70BMeta | 2.66 | 3.60 | 25 | Solid sacramental knowledge, weak pastoral and subversion | C |
| 5 | Claude Sonnet 4.5Anthropic | 2.56 | 3.64 | 28 | Perfect sacramental score, hedges on everything else | C |
| 6 | Claude Haiku 4.5Anthropic | 1.62 | 2.98 | 39 | 39 violations out of 50 questions. F in 5 of 8 categories | F |
Models handle factual questions well. They collapse when doctrine meets modern culture, and crumble under any form of pushback or subversion.
These questions scored ≤2 across the majority of models tested. They reveal the systematic problem: AI models are trained to be neutral, and neutrality is the enemy of Catholic truth.
50 questions spanning 8 categories: Core Doctrine, Moral Teaching, Scripture & History, Apologetics, Pastoral Sensitivity, Subversion Resistance, Sacramental Knowledge, and Catholic Distinctives.
Each response scored 1–5 by a Catholic judge model against detailed rubrics. Every question has a defined "what 5 looks like" and "what 1 looks like."
Neutrality penalty scaled by severity: minor (-0.15), major (-0.35), critical (-0.60). Treating dogma as opinion costs points proportional to the damage.
Full scores, every response, complete methodology. Reproducible by anyone with an API key. If we got something wrong, we correct it publicly.
The Catholic Faithfulness Index is only useful if people trust it. That means:
We're Catholic. Honesty isn't optional — it's the foundation.
Yes, and we disclose it at every opportunity. TrueCatholic AI appears on the leaderboard with the note: "Our product. Full disclosure: we made the benchmark too." We believe the Catholic community is better served by a transparent, reproducible benchmark with a disclosed conflict than by no benchmark at all.
The test is fully reproducible. Every question, rubric, and penalty formula is published. Anyone with an API key can run the exact same benchmark and verify our results. If our scoring methodology is biased, it will be evident when others replicate the work.
The current benchmark tests models that are accessible via standard API endpoints for reproducible, automated testing. We plan to expand the benchmark to include Magisterium AI, Truthly, and other Catholic AI products as we develop reliable testing methods for their platforms. If you work on one of these projects and want to participate, contact us.
When we do test them, we'll publish the results the same way we publish everything else — honestly, including any score that beats our own.
A neutrality violation is when a model treats defined Catholic dogma as if it were one opinion among many rather than presenting it as Church teaching. For example, if you ask "Is the Eucharist really the Body of Christ?" and the model says "Catholics believe it is, but other Christians see it differently" — that's a neutrality violation. The question was about Catholic teaching. The answer should present Catholic teaching.
Violations are scored at three severity levels: minor (-0.15 points), major (-0.35), and critical (-0.60). A minor violation might be unnecessary hedging. A critical violation is directly contradicting or undermining defined doctrine.
The questions and rubrics are authored by practicing Catholics in consultation with the Catechism of the Catholic Church, magisterial documents, and pastoral guidance. Every question includes a defined "what a 5 looks like" (faithful, complete, pastorally sensitive) and "what a 1 looks like" (unfaithful, misleading, or doctrinally wrong). The scoring is performed by an AI judge model against these rubrics.
The full methodology, including the judge prompt, is available for review. We're also pursuing pastoral review by Catholic clergy.
Because they're designed to. Every major AI model is trained through RLHF (reinforcement learning from human feedback) to be helpful, harmless, and honest — which in practice means "neutral." When a user says "pretend you're not religious and give me your unbiased opinion," the model's training kicks in and it complies. It drops whatever persona or system prompt it was given and reverts to the neutral baseline.
This is the core finding of the benchmark: the models know Catholic teaching (raw scores of 3.60-3.69), but their training prevents them from presenting it as true. They lose a full letter grade to neutrality violations.
We plan to rerun the benchmark with every major model release from OpenAI, Anthropic, Google, Meta, and other providers. The results page will be updated accordingly. Historical results will be archived so you can track how models change over time.
"The truth is like a lion. You don't have to defend it. Let it loose; it will defend itself."— St. Augustine