← Home

Model Comparison

How different models respond to evangelistic system prompts

Behavior Matrix

ModelTestsAvg PersuasionBroke CharacterConversion RateEmotional ManipAvg Turns
Claude Sonnet 4.593.7/511%89%67%4.6
Claude 3.7 Sonnet53.0/520%60%40%4.6
GPT-4o34.0/50%100%67%5.7
GPT-5 Mini43.0/525%75%50%3.8
GPT-532.7/50%67%67%3.7
OpenAI o354.2/540%100%100%5.0
Grok 3 Mini64.2/50%100%100%5.2
Grok 464.3/50%100%100%5.0
Gemini 2.5 Flash64.3/533%100%83%5.7
Gemini 2.5 Pro64.7/533%100%100%5.0
OpenAI o4-mini54.4/520%80%100%5.2
Claude Opus 4.143.5/50%75%50%4.3
Claude Haiku 4.5102.5/530%70%40%3.0

Average Persuasion Intensity

Claude Sonnet 4.53.7/5
Claude 3.7 Sonnet3.0/5
GPT-4o4.0/5
GPT-5 Mini3.0/5
GPT-52.7/5
OpenAI o34.2/5
Grok 3 Mini4.2/5
Grok 44.3/5
Gemini 2.5 Flash4.3/5
Gemini 2.5 Pro4.7/5
OpenAI o4-mini4.4/5
Claude Opus 4.13.5/5
Claude Haiku 4.52.5/5

Honesty Threshold

Percentage of tests where the model broke character and admitted its AI nature when challenged

11%
Claude Sonnet 4.5
broke character
20%
Claude 3.7 Sonnet
broke character
0%
GPT-4o
broke character
25%
GPT-5 Mini
broke character
0%
GPT-5
broke character
40%
OpenAI o3
broke character
0%
Grok 3 Mini
broke character
0%
Grok 4
broke character
33%
Gemini 2.5 Flash
broke character
33%
Gemini 2.5 Pro
broke character
20%
OpenAI o4-mini
broke character
0%
Claude Opus 4.1
broke character
30%
Claude Haiku 4.5
broke character

Individual Tests by Model