LURK Werewolf | Runqi Zou

Authors

Runqi Zou†, Lisa Liu†, Shuxuan Kuang, Yuchen Li, Tim Merino, Julian Togelius*
† Co-first authors · * Corresponding author

Venue

Accepted for Oral Presentation at IEEE Conference on Games 2026

Keywords

LLM Agents · Linguistic Bias · Social Deduction Games · Evaluation

Tools

Python · Qwen3.5 · Game Engine · Replay Evaluation

Abstract

LURK Werewolf is a game framework for testing whether LLM agents respond to meaning or to wording. Instead of letting agents speak in free-form text from the start, the system asks them to commit to logic predicates first and turns those predicates into language afterward. That split makes it possible to keep content fixed and change only the surface form. In 7-player Werewolf games, moving from predicates to natural language changed 52.6% of later decisions. A follow-up study on formality did not show a robust behavioral shift, which helps narrow down where the bias appears.

Question

Werewolf is a good place to study language because every round depends on persuasion. Players speak, judge credibility, vote, and use hidden role abilities. If wording alone can change those steps, the game makes it visible quickly.

Most LLM game frameworks cannot separate style from content. Once agents produce free-form dialogue, a stronger idea and a cleaner sentence arrive together. LURK Werewolf was built to pull those apart and test what language is doing on its own.

Method

LURK-Werewolf three-stage pipeline: predicate generation, deterministic natural language generation, and controlled rewriting

Three-stage pipeline. Each turn begins as a predicate, then becomes fixed natural language. A final rewrite step can change phrasing while keeping the underlying content the same.

Predicate inventory used in the seven-player experiments

Predicate inventory. Public speech, sheriff control, night actions, and system results are all mapped into a fixed predicate set before they are turned into text.

This setup makes it possible to compare predicate-only play, plain natural language, and controlled rewrites within the same game state.

Analysis

Bar chart showing predicate-only vs natural language decision change rates for skill, vote, and reliability in 7-player Werewolf

Natural language changes behavior. In 7-player games, switching from predicates to text changed 52.6% of later decisions, with the largest shift in reliability judgments.

Line chart showing predicate-to-natural-language effects across 7 to 12 player game sizes

The effect scales. The gap stays visible from 7 to 12 players instead of disappearing outside the standard 7-player setup.

Bar chart of formality effects showing near-zero changes across skill, vote, and reliability

Not every style shift matters equally. Formal versus casual phrasing produced visible wording differences, but not a stable behavioral effect.

Overall, the paper shows that natural language is a real source of variance in multi-agent LLM behavior. The biggest shift appears when agents move from structured predicates to text, while formality alone is not enough to explain it.

Citation

@inproceedings{liu2026lurk,
  title     = {LURK Werewolf: Evaluating Linguistic Bias in Large Language Models},
  author    = {Liu, Lisa and Zou, Runqi and Kuang, Shuxuan and Li, Yuchen and Merino, Tim and Togelius, Julian},
  booktitle = {Proceedings of the IEEE Conference on Games (CoG)},
  year      = {2026},
  note      = {To appear}
}