By Dickey Singh, CEO and Founder, Cast.app
Like many of us now, I use LLMs hundreds of times a day—to think, write, analyze, challenge assumptions, get to stronger drafts faster, and to get work done.
Over time, that level of use taught me something simple: important questions often deserve a second opinion.
And as I started comparing answers across models more often, it became clear this was not just a personal quirk. It was the early shape of a broader pattern. As you'll see, I'm not alone in this use of multiple LLMs.
An LLM's answer is often useful. That is why LLMs have become part of how so many of us work. They help us move faster. They help us break through blank pages. They help us test ideas, sharpen language, and get to a first draft far more quickly than we could on our own.
But somewhere along the way — with access to multiple LLMs — I stopped treating the first LLM's answer as enough. Not because the model were bad. Often they were very good.
The issue was subtler than that. A strong answer could still feel a little too neat. A little too confident. A little too incomplete. It could sound persuasive while still leaving out something that mattered.
So I started doing what many serious users now do almost automatically.
I asked another LLM. Then another.
I'd start with ChatGPT, then ask Gemini, and then Claude or Grok.
That habit did not come from theory. It came from experience.
At first, I assumed the differences would mostly be stylistic. One model might sound more polished. Another model might be more concise. A third model might structure the answer better. I even thought followup prompting with get me to a common answer.
But the differences were often much deeper than style.
The same question could produce different assumptions, across ChatGPT, Gemni, and Claude. Different caveats. Different omissions. Different confidence levels. Sometimes even different recommendations and at times contradicting.
That was the surprise.
It was not just that LLM models wrote differently. It was that they often reasoned differently enough for the differences to matter.
Once you notice that, it changes how you use them. The first answer stops feeling like the answer. It starts feeling like one answer.
This is where people often oversimplify.
They talk about LLMs as if they are interchangeable. They are not.
Strong LLMs can disagree for many reasons. They are trained differently. They are post-trained differently. They are tuned with different goals around helpfulness, safety, confidence, refusal, and completion.
Then there is the product layer.
Most users are not interacting with a raw model. They are interacting with a model inside a product, with a system prompt, search, retrieval, tool use, formatting rules, memory decisions, and UX choices wrapped around it. That product layer matters far more than most people realize.
One model may infer aggressively. Another may hedge. One may fill in gaps quickly. Another may stay narrow. One may optimize for usefulness. Another may optimize for caution. One may sound more decisive. Another may feel more disciplined.
So when strong LLMs disagree, it is usually not random.
It is often the visible result of different training, different product design, and different assumptions about how an answer should be formed.
That is why “Which model is best?” is often the wrong question.
A better question is: which model is better suited for this question, in this context, with this level of risk?
That was the deeper realization for me. I was not querying some generic thing called AI. I was querying a specific reasoning stack with specific defaults, strengths, weaknesses, and blind spots. That changes the way you use these systems.
A first answer stops feeling like truth delivered from above. It starts feeling like a draft of judgment from one system—shaped by a specific model, version, product approach, assumptions, and the human judgment behind it.
That is a healthy shift.
Not every question needs a second opinion. If I am brainstorming names, rewriting a sentence, or cleaning up a paragraph, the first answer is often enough.
But serious context is different. When the stakes rise, second opinions stop feeling excessive and start feeling responsible.
That is true in medicine. It is true in law. It is true in business. And it is increasingly true with LLMs.
Once I noticed this, cross-checking stopped being occasional.
It became part of serious work.
Ask one model. Read the answer. Ask another. Compare. Look for overlap. Look for contradiction. Look for missing assumptions. Look for confidence without support. Look for the answer that is not only fluent, but durable.
I am clearly not alone in this.
A lot of serious users now do some version of the same thing. They open multiple tabs. They compare ChatGPT, Gemini, Claude, Perplexity, and others. They are not doing it because it is entertaining. They are doing it because they have learned something important.
One answer can be helpful. Multiple answers can reveal judgment.
Before it became a product feature, it was already user behavior.
What many serious users were already doing informally—asking multiple models, comparing answers, and synthesizing judgment—became easier to name once Andrej Karpathy helped make the pattern visible through LLM Council.
Once enough people start doing the same thing manually, the pattern becomes visible. Ask multiple strong models. Compare outputs. Look for agreement. Look for disagreement. Then synthesize.
What began as habit became workflow.
Andrej Karpathy helped make that workflow more explicit with llm-council: multiple models answer, the outputs are compared, and a chairman model synthesizes the result. What many users were already doing informally, he made easier to see as a structured pattern.
That mattered.
Not because it created the need.
Because it gave the need a shape.
It showed that second opinions with LLMs were not just a quirky personal habit. They were the early form of a broader interaction pattern.
Once a behavior becomes common enough, products start absorbing it.
Perplexity did that on the consumer side with Model Council, a native feature that runs multiple models on the same query and synthesizes the result into a single answer.
That matters less because of Perplexity itself and more because of what it signals.
A new interaction pattern is becoming native.
We saw something similar from the inside while building Cast.
Cast had early exposure to multiple frontier LLMs while building for enterprise use cases. At first, we used multi-model parallelism for speed and resilience—whichever strong model answered first could keep the experience moving. Over time, though, the logs revealed something more important than latency: different models often gave meaningfully different answers to the same question. That was the deeper product insight. In serious context, the best answer was not always the fastest one.
That changes the product question.
It is no longer just, “How do we make AI faster?” It becomes, “How do we help people arrive at better judgment?”
That is the more important question.
And it is a question Cast is deeply built around.
Model routing still matters for speed, cost, resilience, and flexibility. Platforms such as OpenRouter, and routing strategies more broadly, are useful for selecting a strong model path for a request. But for higher-stakes questions, the issue is not always which model to route to. Sometimes the bigger question is whether one model’s answer is enough at all. Routing helps you choose a model. Second-opinion workflows help you challenge one.
In consumer AI, second opinions are useful. In enterprise AI, they can be essential.
That is because enterprise questions are rarely clean. They span systems, stakeholders, policy, tone, timing, customer history, business risk, and organizational memory. The right answer is not always the shortest one. It is not always the fastest one either.
Sometimes what matters most is seeing where strong models agree.
Sometimes what matters most is seeing where they do not.
That difference matters when the output is shaping a customer conversation, an executive briefing, a success plan, a renewal discussion, a compliance narrative, an account strategy, or an internal decision with real consequences.
This is where enterprise AI has to grow up.
Fluent text is not enough. Serious teams need grounded answers, better judgment, stronger synthesis, and more confidence that the answer will hold up once it leaves the screen and enters the real world.
That is exactly why Cast matters.
Not as a wrapper. Not as another chat box. But as part of the shift toward AI that is more grounded, more enterprise-aware, and more capable of helping people make better decisions when the context actually matters.
This approach is not needed for everything.
It is most useful when the question is important enough that differences in interpretation matter.
I think this is the larger shift now underway.
The future is probably not just better single-model answers.
It is better judgment across models.
The interaction pattern is moving from:
ask once, get one answer
to:
ask broadly, compare intelligently, synthesize carefully
That is a better pattern for serious work.
And it feels natural because it mirrors how people already behave when something matters. We ask for another read. Another expert. Another opinion. Not because the first one had no value, but because the decision deserves more rigor.
LLMs will keep getting better.
But for important questions, the winning pattern may not be blind trust in one model. It may be second opinions, made native.
That is where the market is heading.
And that is part of what Cast is building for.