Evaluating AI agents on concurrency bug fixes is uniquely challenging. Unlike typical software bugs, concurrency bugs involve non-deterministic behavior that depends on thread interleavings. A test suite alone is insufficient as an oracle: tests may pass in most interleavings but fail in specific, hard-to-trigger scenarios that result in race conditions.
This makes standard evaluation approaches unreliable. For context, SWE-Bench—the most widely used benchmark for AI coding agents—contains only one example of a race condition among its hundreds of tasks.
We propose to use controlled concurrency testing to systematically explore thread interleavings, providing a stronger guarantee that a patch truly fixes the underlying issue rather than just passing tests by chance.
Spaghetti Bench uses Fray, a controlled concurrency testing tool, to exhaustively search over possible thread schedules. This allows us to verify that fixes are correct across all interleavings, not just the ones that happen to execute during a test run.
Our dataset consists of 39 concurrency bug examples, featuring a combination of buggy programs from SCTBench, a standard concurrency bug benchmark, as well as real-world issues discovered by Fray in the open-source project Apache Kafka.