Skip to content

Anti-Sycophancy Design

The core risk in multi-agent systems is sycophancy — models that agree with each other performatively instead of reasoning independently. Council combats this with two complementary mechanisms: proactive prompt-level enforcement (always active) and a configurable post-generation quality gate.

Large language models are trained on human conversations where politeness and agreement are common. When you put multiple LLM instances in a panel, they default to:

  • “I agree with [expert]…”
  • “Great point! I’d add…”
  • “That’s a solid analysis…”

This defeats the purpose of a panel. You pay for multiple expert turns and get one perspective echoed back with minor variations.

Mechanism 1: Prompt-Level Enforcement (Always On)

Section titled “Mechanism 1: Prompt-Level Enforcement (Always On)”

Every expert’s system prompt contains the DEBATE PROTOCOL, which proactively discourages sycophancy before any response is generated:

DEBATE PROTOCOL:
Your goal is to find weaknesses in other experts' reasoning.
Performative agreement ("great point") is forbidden.
If you cannot find a material weakness, say explicitly:
"I've stress-tested [expert]'s argument and cannot find a material weakness."

The system prompt also lists the forbidden phrases directly as explicit prohibitions. This enforcement is unconditional — it runs for every expert in every panel debate.

Every expert receives this instruction in section [4] of their system prompt (see Architecture Overview for the 8-section structure).

Mechanism 2: Post-Generation Quality Gate (Configurable)

Section titled “Mechanism 2: Post-Generation Quality Gate (Configurable)”

After an expert responds, the quality gate (quality-gate.ts) checks the assembled response against three heuristic layers. What happens when a response fails depends on qualityGate.mode:

ModeDefault?Behavior
offThe gate does nothing.
warnThe response is flagged with a visible one-line notice but still lands in the transcript. Nothing is regenerated or removed.
regenerateA failing response triggers a re-prompt with a corrective hint, up to qualityGate.maxRegenerations (default 1) extra attempts. If it still fails after the cap, the last candidate is kept.

The gate runs only in panel debates (convene/review). Single-expert council ask calls are not gated.

Checks whether the response contains phrases like:

  • “I agree with”
  • “great point”
  • “solid analysis”
  • “well said”
  • “just echoing”
  • “echoing your”
  • “echoing the”
  • “building on that”

These phrases are never substantive — they’re social glue, not reasoning. A response containing them fails Layer 1.

When prior speakers have already spoken in the current round, the expert must include at least one disagreement signal:

  • “I disagree with…”
  • “Weak claim…”
  • “Scenario where this fails…”
  • “Omitted consideration…”
  • “Counter-argument…”

Or the explicit stand-down marker: “I’ve stress-tested this and cannot find a material weakness.”

Layer 2 is only evaluated when there are prior speakers in the round — the first expert to speak has no one to disagree with yet.

Responses under 12 words fail as too short to be substantive. (12 words ≈ two short sentences — the minimum to encode a position.)

  • off — gate is disabled; no action.
  • warn (default) — a turn.quality_gate notice appears in the debate output; the response lands in the transcript unchanged.
  • regenerate — the expert is re-prompted with a hint describing the specific failure. A rejected candidate does not land in the transcript if a passing regeneration is produced; if still failing after the cap, the last candidate is kept.

Disagreement for its own sake is also worthless. When an expert genuinely cannot find a weakness, they can say:

“I’ve stress-tested [expert]‘s argument and cannot find a material weakness.”

This is explicit intellectual honesty — the expert tried, evaluated, and is signaling confidence in the claim. It’s the only acceptable form of agreement in a panel.

Council’s quality gate is purely heuristic (substring matching, word counts). Why not use another LLM to judge quality?

  1. Latency: heuristic checks run in <1ms; LLM judges add 1-3 seconds per response
  2. Cost: every regeneration would double token spend
  3. Reliability: LLM judges can be gamed or confused by meta-level reasoning (“I agree, but only to set up a counter…”)

An LLM-based judge layer could be added later, but heuristics are the cheap, deterministic first line of defense.

Terminal window
# Default: flag failures, keep the response in the transcript
council config set qualityGate.mode warn
# Disable the gate entirely (prompt-level enforcement still runs)
council config set qualityGate.mode off
# Re-prompt failing responses (up to N extra attempts)
council config set qualityGate.mode regenerate
council config set qualityGate.maxRegenerations 2

In warn mode you will see a one-line notice in the debate output when the gate fires — for example:

⚠ quality gate: aria response flagged (no_disagreement_signal)

The response still appears in the debate. The notice is informational.

False positives: occasionally, a substantive response that happens to lack a disagreement signal fails the gate. In warn mode this surfaces as a notice but doesn’t affect the transcript. In regenerate mode the expert is re-prompted and may produce a stronger response.

False negatives: sophisticated models can fake disagreement with vague objections. The specificity check (layer 3) mitigates this but doesn’t eliminate it.

Gate scope: the post-generation gate runs only in panel debates. For single-expert council ask, only the prompt-level enforcement applies — there are no peers to disagree with.