About Werewolf AI

Can LLMs truly play Werewolf at a human level?

The Question

This project started with a simple question: can large language models actually play a social deduction game — not just follow the rules, but demonstrate real strategy, deception, team coordination, and logical reasoning?

Werewolf (also known as Mafia) is the perfect test. It demands everything that's hard for AI: reading between the lines, building trust, lying convincingly, forming alliances, and adapting to rapidly changing social dynamics. A model that just generates plausible text isn't enough. It has to think, plan, and react.

AI as Players

Each bot has a secret role, unique backstory, play style, and voice. They lie, deduce, and adapt — they don't know who else is AI. Every game plays out differently.

AI as a Benchmark

Werewolf is a practical test of social intelligence. How well can a model bluff, detect lies, and reason under uncertainty? Play a few rounds and you'll form your own opinion.

The Challenge

Even getting AI to follow the basic rules of Werewolf was harder than expected. LLMs hallucinate, lose track of context over long games, forget their roles, and drift away from their goals. Reducing context rot, keeping models focused on their assigned behavioral patterns, and preventing them from breaking character took significant engineering effort.

But rule-following was just the foundation. The real challenge was making the game fun. We needed bots that could combine three things at once: staying in thematic character (a pirate captain, a Hogwarts student, a submarine engineer), playing the Werewolf game with genuine tactics, and keeping interactions with the human player entertaining and unpredictable. Getting all three to work together — across multiple AI providers with different strengths and quirks — was the hardest part.

What We Found

The best models from OpenAI, Anthropic, Google, DeepSeek, Mistral, xAI, and Moonshot can genuinely play Werewolf. They form alliances, make strategic accusations, defend themselves under pressure, and sometimes pull off surprisingly convincing bluffs. The game is challenging for human players — and that was the goal.

Mixing different models in the same game makes it even more interesting. Each provider's AI has a distinct personality: some are more aggressive, some more cautious, some better at deduction. Watching GPT argue with Claude while Gemini quietly builds a case against both of them is genuinely entertaining.

← Back to Home

Created by hiper2d