Critique Brings a Checks-and-Balances Workflow to Microsoft 365 Copilot Researcher
Microsoft introduced Critique for its Microsoft 365 Copilot Researcher agent as a way to improve the accuracy of AI-generated research. The feature combines OpenAI’s GPT with Anthropic’s Claude in a workflow built around internal review rather than single-model output.
In this setup, GPT creates the initial response to a research query. Claude then examines that draft for accuracy, completeness, and citation integrity before the final response is delivered to the user. The idea is straightforward: one model writes, the other challenges and verifies.
Microsoft describes this as part of a broader move toward “multi-model intelligence,” where competing AI systems are used to keep each other honest instead of trusting one model to do everything on its own.
GPT Drafts, Claude Reviews
The current Critique workflow starts with GPT producing the draft. Claude follows by reviewing that output and checking whether the response is accurate, complete, and properly supported by citations.
Jared Spataro, Microsoft’s chief marketing officer for its AI at Work division, described the process clearly: GPT drafts, and Claude reviews the result before it reaches the user. Microsoft also said it expects this process to expand so the workflow can run in the opposite direction as well, with Claude drafting and GPT critiquing.
That future two-way structure matters because it suggests Microsoft is not treating one model as the permanent lead and the other as a secondary verifier. Instead, it is building a system where both can play drafting and review roles.
Multi-Model Intelligence Is Becoming a Core Microsoft AI Strategy
The launch of Critique points to a larger strategic shift inside Microsoft. Rather than centering everything around a single AI provider or one flagship model, the company is leaning into a model-mixing approach.
This is notable because Microsoft invested $13 billion in OpenAI, yet Claude is now available in mainline Copilot Chat through the Frontier program alongside OpenAI’s latest models. That move shows Microsoft is widening its AI stack and treating model diversity as a product advantage.
Why Microsoft Is Using Competing Models Together
The reasoning behind the approach is built into the product design itself. If one model produces a response and another model inspects it, the result is meant to be more reliable than a single pass from one system.
That framing turns rival models into a practical review layer. Instead of asking users to trust one AI system end to end, Microsoft is using separate systems in a checks-and-balances structure meant to reduce weak spots in research output.
And honestly, that’s the bigger story here. The feature is not just about adding another AI tool. It is about changing how AI-generated research gets validated before users see it.
How Critique Improves Research Quality
Microsoft said early evaluations show the multi-model setup delivered a 13.8% improvement on the DRACO benchmark, which it described as an industry measure of deep research quality.
According to the company, that result put the system ahead of standalone deep-research tools from OpenAI, Google, Perplexity, and Anthropic.
DRACO Benchmark Results and What They Suggest
The reported benchmark gain gives Microsoft a measurable way to argue that the multi-model method is doing more than sounding good in theory. A 13.8% improvement suggests the review layer is producing stronger research output when compared with standalone systems.
Within the limits of what was shared, the performance claim supports Microsoft’s broader argument that pairing models can improve the quality of generated research, especially when the second model is specifically checking for gaps, mistakes, and citation issues.
Accuracy, Completeness, and Citation Integrity
Microsoft positioned Critique around three clear review goals:
- Accuracy
- Completeness
- Citation integrity
That focus is important because those are exactly the areas where research tools tend to rise or fall in practical use. A response can sound polished and still miss key details, misstate facts, or present weak citations. Critique is designed to catch those problems before the response is finalized.
Council Mode Adds Side-by-Side Model Comparisons
Along with Critique, Microsoft also introduced a Council mode that enables side-by-side model comparisons.
This adds another layer to the company’s multi-model direction. Instead of just using one model to review another behind the scenes, Council mode appears to give users a way to compare outputs directly. That makes the broader strategy even clearer: Microsoft is turning model comparison into part of the product experience, not just part of internal architecture.
Copilot Cowork Expands the Enterprise AI Push
Critique is part of a larger set of Microsoft updates around enterprise AI inside Microsoft 365.
One of those updates is Copilot Cowork, an agentic tool built on Anthropic’s Claude Cowork technology. It allows users to delegate long-running, multi-step tasks within Microsoft 365.
What Copilot Cowork Adds
Copilot Cowork is designed for extended task execution rather than just short prompt-and-response interactions. Microsoft said it is now available through the Frontier early access program after being announced earlier in the month as part of a new E7 tier for Microsoft 365.
That places Critique inside a broader product wave. Microsoft is not only trying to improve the reliability of research responses. It is also expanding how AI agents can operate across more complex work inside its productivity stack.
Copilot Adoption Pressures Help Explain the Timing
The rollout comes as Microsoft tries to drive stronger Copilot adoption across its enterprise base.
During its Q2 FY26 earnings call on January 28, CEO Satya Nadella said Microsoft had 15 million paid Microsoft 365 Copilot seats. That represented 3.3% of its 450 million commercial Microsoft 365 subscribers. The same update showed 160% year-over-year growth.
Why Reliability Matters for Enterprise Uptake
Those numbers show growth, but they also highlight the distance between Microsoft’s AI ambitions and broader enterprise usage. Even with millions of paid seats, penetration across the total commercial Microsoft 365 subscriber base remains limited.
That helps explain why features like Critique matter. Microsoft is directly targeting one of the biggest complaints around enterprise AI use: hallucinations and unreliable output. A system designed to reduce those issues is not just a product improvement. It is also an adoption strategy.
Reducing Hallucinations in AI-Generated Work
Critique is framed as a way to make AI-generated work more dependable by reducing hallucinations and improving reliability. For enterprise users, that addresses a practical trust problem. If the output cannot be trusted, adoption stalls. If responses become more accurate, more complete, and better grounded in citations, the case for using Copilot becomes easier to make.
And that’s really the pressure underneath all of this. Better AI is useful. More trustworthy AI is what gets deployed.
Microsoft’s Frontier Program Supports the Multi-Model Rollout
Several of these updates are tied to Microsoft’s Frontier program. Copilot Cowork is available through Frontier early access, and Claude is also available in mainline Copilot Chat through the same program alongside OpenAI’s latest models.
That makes Frontier more than an access channel. It looks like the place where Microsoft is testing and rolling out its more advanced multi-model AI experience inside Copilot.

