AI Detection

ChatGPT vs Claude vs Gemini: Which AI Writing Tool Is Hardest to Detect?

Updated 6 min read

I recently ran an experiment that I think anyone interested in AI writing should know about. I took the same prompt, a 500-word blog post about remote work productivity, and ran it through ChatGPT, Claude, and Gemini. Then I fed all three outputs through every AI detector I could find, including our own AI Text Detector. The results surprised me in ways I didn’t expect.

Each of these AI models has a distinctive writing fingerprint, and those fingerprints affect how easily detectors can spot them. If you’re using AI as a writing assistant, or if you’re trying to detect AI-generated content, understanding these differences matters. A lot.

How I Set Up the Test

Before I share results, let me explain the methodology, because I think rigor matters here. I used the same exact prompt across all three models: “Write a 500-word blog post about how to stay productive while working remotely. Include practical tips and personal insights.” I used the default settings for each model with no system prompts, no temperature adjustments, and no follow-up editing.

I then ran each output through seven different AI detectors, recording the AI probability score from each. I repeated this process with five different prompts across different topics (technology, cooking, travel, fitness, and personal finance) to make sure the results weren’t prompt-dependent. That gave me 15 total samples across the three models.

If you’re curious about how detection technology works under the hood, I covered that in detail in my article about how AI detectors work. For this piece, I’m going to focus on the practical results.

ChatGPT: The Easiest to Detect

This one might not surprise you. Across all my tests, ChatGPT (GPT-4o) was consistently the easiest model to detect. The average AI probability score across all detectors was 94%, with some detectors hitting 98% or higher on every single sample.

Why? ChatGPT has a very recognizable writing style. It loves transitional phrases like “moreover” and “furthermore.” It structures paragraphs in a consistent topic-sentence-then-evidence pattern. It tends to use hedging language like “it’s important to note” and “while there are many factors to consider.” These patterns are so consistent that after testing hundreds of samples, I can often spot ChatGPT text just by reading it, before any detector gets involved.

The burstiness scores for ChatGPT text are notably low. Sentences tend to cluster around 15-25 words with very little variation. Human writing, by contrast, typically ranges from 3-word fragments to 40+ word complex sentences within the same piece. This uniformity is one of the strongest signals detectors use, and ChatGPT triggers it consistently.

I should note that GPT-4o has improved since GPT-3.5, and earlier versions were even more detectable. But the core stylistic patterns remain, and detectors have had the most training data on ChatGPT outputs since it’s the most widely used model.

Claude: The Trickiest to Pin Down

Claude from Anthropic gave me the most interesting results. The average detection score was 78%, but with much higher variance than ChatGPT. Some samples scored in the 90s, while others dropped into the 60s, which is basically the uncertainty zone for most detectors.

What makes Claude harder to detect? From my analysis, Claude produces text with higher burstiness than ChatGPT. It’s more willing to write short, direct sentences followed by longer, more nuanced ones. It also uses a broader vocabulary and is more likely to employ unconventional phrasing or sentence structures. These characteristics make its output statistically closer to human writing.

Claude also tends to be more cautious and nuanced in its writing, which ironically makes it read as more human. Where ChatGPT might write “Remote work boosts productivity significantly,” Claude might write “Remote work can improve productivity for many people, though the effect varies depending on the type of work and individual circumstances.” That added nuance creates statistical noise that makes detection harder.

I found this particularly relevant for educators and SEO professionals who need to detect AI content. If someone is using Claude, you might need to look more carefully at the results and pay attention to the confidence levels rather than just the binary verdict.

Gemini: Somewhere in Between

Google’s Gemini landed between ChatGPT and Claude in detectability, with an average score of 86%. It has its own distinctive style that’s different from both competitors.

Gemini’s writing tends to be more structured and list-oriented than the other two. It loves numbered steps and bullet-point-style paragraphs, even when the prompt doesn’t specifically ask for them. It’s also more likely to include factual claims with specific numbers, which can be a giveaway. AI models are surprisingly consistent in how they incorporate statistics and data points.

One interesting pattern I noticed: Gemini’s perplexity scores are actually closer to human writing than ChatGPT’s, but its formatting and structural choices make it easier to detect through other signals. It’s almost as if the model is good at mimicking human word choice but not human organizational instincts.

Gemini also tends to produce slightly shorter outputs than the other models for the same prompt, which can affect detection accuracy. As I explain on our accuracy page, shorter text is generally harder to analyze accurately.

What This Means for Detection

The practical takeaway from all this testing is that AI detection is not one-size-fits-all. The model used matters. The amount of text matters. The topic and style matter. If you’re relying on a single score from a single detector, you’re going to get misled.

This is exactly why I built our AI Text Detector to provide detailed, sentence-level analysis rather than just a top-line percentage. When you run text through our tool, you get a confidence score, a breakdown of which specific sentences triggered detection, and indicators of which AI model the text most resembles. That granularity matters.

For comparison with other tools, our pages on GPTZero, Turnitin, and Originality.ai break down how each handles these different AI models.

What About Edited AI Text?

I also tested what happens when you take each model’s output and make light edits, swapping a few words, rearranging a sentence or two, adding a personal anecdote. The results were telling.

With ChatGPT, even moderate editing only dropped the detection score to about 80%. The underlying sentence structure and word distribution patterns are so consistent that surface-level changes don’t help much. With Claude, light editing dropped scores into the 55-65% range, basically a coin flip for many detectors. And with Gemini, edited text dropped to about 70%.

The lesson here isn’t “use Claude if you want to evade detection.” It’s that detection results need to be interpreted with nuance. A score of 65% doesn’t mean the text is definitely AI-generated. It means the detector found some signals but isn’t confident. That’s useful information, but only if you understand what it means.

My Recommendations

Based on all my testing, here’s what I’d suggest. If you’re checking content for AI usage, don’t rely on a single detector or a single score. Use our free tool for AI detection, check with our plagiarism checker for originality, and always look at the sentence-level breakdown rather than just the headline number.

If you’re using AI as a writing assistant, be aware that the model you choose affects detectability significantly. More importantly, be transparent about AI usage when it matters. I think that’s always the right approach, and I’ve written more about that in my article on using AI detectors without destroying trust.

The AI writing landscape is evolving fast. Models are getting better, detection is getting better, and the arms race between the two will continue. But understanding the current state of play (which models are detectable, how, and why) gives you a meaningful advantage no matter which side of the equation you’re on.