Anti-Sycophancy Design
The core risk in multi-agent systems is sycophancy — models that agree with each other performatively instead of reasoning independently. Council combats this with two complementary mechanisms: proactive prompt-level enforcement (always active) and a configurable post-generation quality gate.
The Problem
Section titled “The Problem”Large language models are trained on human conversations where politeness and agreement are common. When you put multiple LLM instances in a panel, they default to:
- “I agree with [expert]…”
- “Great point! I’d add…”
- “That’s a solid analysis…”
This defeats the purpose of a panel. You pay for multiple expert turns and get one perspective echoed back with minor variations.
Mechanism 1: Prompt-Level Enforcement (Always On)
Section titled “Mechanism 1: Prompt-Level Enforcement (Always On)”Every expert’s system prompt contains the DEBATE PROTOCOL, which proactively discourages sycophancy before any response is generated:
DEBATE PROTOCOL:Your goal is to find weaknesses in other experts' reasoning.Performative agreement ("great point") is forbidden.If you cannot find a material weakness, say explicitly:"I've stress-tested [expert]'s argument and cannot find a material weakness."The system prompt also lists the forbidden phrases directly as explicit prohibitions. This enforcement is unconditional — it runs for every expert in every panel debate.
Every expert receives this instruction in section [4] of their system prompt (see Architecture Overview for the 8-section structure).
Mechanism 2: Post-Generation Quality Gate (Configurable)
Section titled “Mechanism 2: Post-Generation Quality Gate (Configurable)”After an expert responds, the quality gate (quality-gate.ts) checks the assembled response against three heuristic layers. What happens when a response fails depends on qualityGate.mode:
| Mode | Default? | Behavior |
|---|---|---|
off | The gate does nothing. | |
warn | ✓ | The response is flagged with a visible one-line notice but still lands in the transcript. Nothing is regenerated or removed. |
regenerate | A failing response triggers a re-prompt with a corrective hint, up to qualityGate.maxRegenerations (default 1) extra attempts. If it still fails after the cap, the last candidate is kept. |
The gate runs only in panel debates (convene/review). Single-expert council ask calls are not gated.
Layer 1: Forbidden Phrases
Section titled “Layer 1: Forbidden Phrases”Checks whether the response contains phrases like:
- “I agree with”
- “great point”
- “solid analysis”
- “well said”
- “just echoing”
- “echoing your”
- “echoing the”
- “building on that”
These phrases are never substantive — they’re social glue, not reasoning. A response containing them fails Layer 1.
Layer 2: Disagreement Budget
Section titled “Layer 2: Disagreement Budget”When prior speakers have already spoken in the current round, the expert must include at least one disagreement signal:
- “I disagree with…”
- “Weak claim…”
- “Scenario where this fails…”
- “Omitted consideration…”
- “Counter-argument…”
Or the explicit stand-down marker: “I’ve stress-tested this and cannot find a material weakness.”
Layer 2 is only evaluated when there are prior speakers in the round — the first expert to speak has no one to disagree with yet.
Layer 3: Specificity Check
Section titled “Layer 3: Specificity Check”Responses under 12 words fail as too short to be substantive. (12 words ≈ two short sentences — the minimum to encode a position.)
What Happens on Failure
Section titled “What Happens on Failure”off— gate is disabled; no action.warn(default) — aturn.quality_gatenotice appears in the debate output; the response lands in the transcript unchanged.regenerate— the expert is re-prompted with a hint describing the specific failure. A rejected candidate does not land in the transcript if a passing regeneration is produced; if still failing after the cap, the last candidate is kept.
The Stand-Down Rule
Section titled “The Stand-Down Rule”Disagreement for its own sake is also worthless. When an expert genuinely cannot find a weakness, they can say:
“I’ve stress-tested [expert]‘s argument and cannot find a material weakness.”
This is explicit intellectual honesty — the expert tried, evaluated, and is signaling confidence in the claim. It’s the only acceptable form of agreement in a panel.
Why Heuristics Instead of LLM Judges?
Section titled “Why Heuristics Instead of LLM Judges?”Council’s quality gate is purely heuristic (substring matching, word counts). Why not use another LLM to judge quality?
- Latency: heuristic checks run in <1ms; LLM judges add 1-3 seconds per response
- Cost: every regeneration would double token spend
- Reliability: LLM judges can be gamed or confused by meta-level reasoning (“I agree, but only to set up a counter…”)
An LLM-based judge layer could be added later, but heuristics are the cheap, deterministic first line of defense.
Configuring the Quality Gate
Section titled “Configuring the Quality Gate”# Default: flag failures, keep the response in the transcriptcouncil config set qualityGate.mode warn
# Disable the gate entirely (prompt-level enforcement still runs)council config set qualityGate.mode off
# Re-prompt failing responses (up to N extra attempts)council config set qualityGate.mode regeneratecouncil config set qualityGate.maxRegenerations 2In warn mode you will see a one-line notice in the debate output when the gate fires — for example:
⚠ quality gate: aria response flagged (no_disagreement_signal)The response still appears in the debate. The notice is informational.
Trade-offs
Section titled “Trade-offs”False positives: occasionally, a substantive response that happens to lack a disagreement signal fails the gate. In warn mode this surfaces as a notice but doesn’t affect the transcript. In regenerate mode the expert is re-prompted and may produce a stronger response.
False negatives: sophisticated models can fake disagreement with vague objections. The specificity check (layer 3) mitigates this but doesn’t eliminate it.
Gate scope: the post-generation gate runs only in panel debates. For single-expert council ask, only the prompt-level enforcement applies — there are no peers to disagree with.
Relation to Other Concepts
Section titled “Relation to Other Concepts”- Deliberation Model — why disagreement is the core mechanic
- Persona Experts — how document-grounded experts maintain distinct perspectives
- Moderation Strategies — how devil’s-advocate mode amplifies disagreement