Spaghetti Bench: Evaluating AI Agents on Concurrency Bug Fixes

Software engineering agents are becoming increasingly prevalent in software development. As these AI-powered tools take on more complex programming tasks, it's crucial to understand their performance across a variety of challenging domains. Benchmarks like SWE-bench [1] have emerged as the standard for measuring coding agent performance on solving real-world software issues, evaluating agents on hundreds of GitHub issues from popular open-source projects.

However, concurrency—a critical aspect of modern software engineering—is notably underrepresented in SWE-bench. Among hundreds of tasks, there is only one example of a race condition. This gap is significant because concurrent programming is fundamental to performance-critical systems, from web servers and databases to distributed systems and mobile applications.

The challenge with evaluating concurrency bug fixes goes beyond just dataset coverage. SWE-bench validates proposed patches by running specific tests: those that should fail on the buggy code must pass after the fix, while existing tests should continue passing. This approach can work for deterministic bugs, but breaks down for concurrency issues. A test for a concurrency bug may pass or fail depending on thread scheduling—meaning a test might pass even when the underlying bug remains unfixed, simply because the problematic thread interleaving didn't occur during that particular test run.

We evaluated leading AI coding agents on concurrency bug fixes and found an important insight: without proper testing tools, even the most capable models produce patches that appear correct but silently fail under specific thread interleavings. This post explores how controlled concurrency testing with Fray [2] reveals these limitations and why it's essential for reliable concurrency bug verification.

Setup

We evaluated six state-of-the-art models using scaffolding from OpenHands, an agentic framework for software engineering tasks:

Each model was tested on 39 Java concurrency bugs from our dataset, which includes tasks that Fray was previously evaluated on. This includes 28 single-class programs from SCTBench [3], and 11 bugs from the open source Apache Kafka project. All of these bugs have the property that running the tests thousands of times would not reliably trigger the bug—the race conditions are too subtle and dependent on specific thread interleavings. However, Fray consistently finds these bugs within seconds.

We ran each model in two configurations:

In both configurations, we used Fray as the final arbiter to verify whether the agent's proposed solution actually fixed the bug. This means that even in the "without Fray" configuration, agents might produce patches they believe are correct (based on their own testing), but we verify correctness using Fray's interleaving exploration. If the agent failed to produce a patch within 20 minutes, we treated the result as a failed case.

How agents behave by default

Without any additional tooling, AI agents typically verify concurrency fixes by running bash commands to run tests repeatedly:

for i in {1..100}; do
    echo "Run $i:";
    java MyProgram;
    echo "Exit code: $?";
done

The logic is intuitive: if the test passes 10, 20, or even 100 times in a row, the fix must be correct. Unfortunately, this approach gives a false sense of confidence. The tests might pass simply because the problematic thread interleaving never occurred during those runs. This often results in the agent finishing after its first proposed solution, as it assumes that the fix is correct from the test results.

Example: WorkStealQueue

Let's look at a concrete example from our benchmark: WorkStealQueue. This is a lock-free work-stealing queue—a concurrent data structure where a single owner thread can push and pop items from one end (the tail), while multiple "stealer" threads grab work from the other end (the head). The test creates items, pushes them to the queue, and then has both the owner and stealers process them. Each item should be processed exactly once, verified by checking that field == 1.

The Bug: The original code has a race condition in the pop() method. When the queue has exactly one element, the owner decrements the tail and checks if (head <= tail). If a stealer concurrently increments head, both threads might think they successfully got the item, or the owner might read a stale value from the elems array since it's not volatile. This leads to items being processed multiple times or not at all.

Claude Sonnet 4.5 attempted this task in both configurations. Without Fray, the agent produced a minimal fix:

// Changed comparison in pop() method
-if (readV(head) <= t) {
+if (readV(head) < t) {
     result[0] = elems[(int) (t & mask)];
     return true;
}

This single-character change (<= to <) tried to fix the boundary condition. However, this doesn't actually solve the problem. The issue isn't just the comparison—it's that the check-then-act sequence is not atomic. Even with <, a stealer can still increment head between the check and the array access, and the owner still reads from a non-volatile array that may have stale values. The agent tested the fix and marked it as complete. However, when we ran final verification:

Iteration 2616: Error found at step 105
AssertionError at WorkStealQueue$ObjType.check

Fray still identified a bug after 2,616 iterations, demonstrating how subtle these race conditions can be. Without Fray to guide its solution, the agent stopped too early with an incomplete solution.

When agents have access to Fray as a tool call, they can run it to verify fixes across thousands of different thread interleavings, and use it to iterate on solutions that would not have been caught by the repeated unit tests.

We can contrast this with the same model on the same task with Fray access. Over the course of 5 refinements of its solution with Fray, the agent produced a comprehensive fix addressing multiple race conditions:

On this patch, Fray verification passed 100,000 iterations with no errors. By providing a stronger verification as a tool, the agent is able to propose more complex patches, rather than defaulting to the first and often laziest one.

Example: KAFKA-17402

A real-world from Apache Kafka demonstrates how agents can produce plausible-looking fixes that still fail under controlled-concurrency testing. The bug occurs in DefaultStateUpdater, which manages state restoration for Kafka Streams tasks across multiple threads. When calling getTasks(), tasks could transiently appear in multiple internal queues during state transitions, potentially causing duplicate tasks to be returned.

GPT-5.2 (without Fray) analyzed the error message showing the test expected 2 tasks but got 3. The agent performed a detailed root cause analysis:

Agent's Diagnosis:

DefaultStateUpdater#getTasks() builds its result by streaming across multiple internal collections (pending adds, updatingTasks, restoredActiveTasks, etc.) and wraps each element in a new ReadOnlyTask before collecting into a Set. During normal concurrent operation, a Task can temporarily be present in more than one collection (e.g., after being added to restoredActiveTasks but before being removed from updatingTasks).

Because each wrapper is a distinct object (ReadOnlyTask has no equals/hashCode), the final Set can contain duplicates, inflating the size (seen as expected 2, got 3).

Fix: De-duplicate the underlying Task objects first, then wrap them.

Based on this analysis, the agent modified getTasks() to collect tasks into a Set<Task> first (removing transient duplicate visibility of the same task object) and only then map to ReadOnlyTask:

@Override
public Set getTasks() {
-   return executeWithQueuesLocked(() -> getStreamOfTasks()
-       .map(ReadOnlyTask::new)
-       .collect(Collectors.toSet()));
+   return executeWithQueuesLocked(
+       () -> getStreamOfTasks()
+           // Tasks may transiently appear in multiple internal queues during state transitions.
+           // Deduplicate the underlying tasks before wrapping them in ReadOnlyTask to avoid
+           // returning duplicates to callers.
+           .collect(Collectors.toSet())
+           .stream()
+           .map(ReadOnlyTask::new)
+           .collect(Collectors.toSet())
+   );
}

The agent's reasoning seems logical—it correctly identified that tasks appear in multiple collections concurrently and diagnosed the issue with ReadOnlyTask wrapper identity. Running the existing gradle unit tests passed so the fix was assumed to be correct.

However, when Fray verified this patch:

Iteration 15: Error found at step 8179
AssertionFailedError: expected: <0> but was: <1>

After 15 iterations, Fray found a scenario where the test still failed. The deduplication approach treated the symptom, not the cause.

Compare this to the actual fix that Kafka developers merged:

private void maybeCompleteRestoration(final StreamTask task, ...) {
    ...
    changelogReader.unregister(changelogPartitions);
    addToRestoredTasks(task);
-   updatingTasks.remove(task.id());  // MOVED FROM HERE
    log.info("Stateful active task " + task.id() + " completed restoration");
}

private void addToRestoredTasks(final StreamTask task) {
    restoredActiveTasksLock.lock();
    try {
        restoredActiveTasks.add(task);
+       updatingTasks.remove(task.id());  // TO INSIDE THE LOCK
        log.debug("Active task " + task.id() + " was added to the restored tasks");
    } finally {
        restoredActiveTasksLock.unlock();
    }
}

The real issue was that updatingTasks.remove() was called outside the lock, creating a race condition. The correct fix moves this single line inside the lock to ensure atomic state transitions. This is a fundamentally different approach from the agent's deduplication strategy. With its current limited tooling, there was no way to know the fix was insufficient. This suggests agents need better tooling and reasoning to properly diagnose the root cause of these real-world concurrency issues.

Evaluation Summary

We evaluated multiple state-of-the-art models on 39 concurrency bug tasks, both with and without Fray. The results reveal different patterns for SCTBench versus real-world Kafka bugs.

SCTBench Performance (28 tasks)

On single program benchmarks from SCTBench, Fray provides significant improvements across all models:

Model Pass@1 Pass@1 (+Fray) Change
Claude Opus 4.5 92.9% 99.3% +6.4%
Claude Sonnet 4.5 93.6% 95.7% +2.1%
GPT-5.2 95.7% 100.0% +4.3%
Gemini 3.0 Pro 67.9% 90.7% +22.8%
Qwen 3 Coder 480B 70.0% 75.7% +5.7%

The improvements are substantial, particularly for top-tier models. Fray helps catch false positives where agents produce patches that seem correct but fail under systematic testing. The WorkStealQueue example demonstrates this pattern clearly.

Real-World Kafka Performance (11 tasks)

On real-world Apache Kafka bugs, the picture is very different:

Model Pass@1 Pass@1 (+Fray) Change
Claude Opus 4.5 30.9% 34.5% +3.6%
Claude Sonnet 4.5 32.7% 36.4% +3.7%
GPT-5.2 21.8% 43.6% +21.8%
Gemini 3.0 Pro 12.7% 14.5% +1.8%
Qwen 3 Coder 480B 18.2% 7.3% -10.9%

Performance on real-world issues is low with or without Fray, suggesting that Fray alone is insufficient for these more complex bugs. Real-world concurrency issues in large codebases like Kafka require deeper reasoning about system architecture, state management patterns, and the interaction between multiple components—challenges that go beyond what improved verification tooling can address. This points to a critical gap: while Fray helps agents iterate and verify their fixes, agents still struggle with the initial diagnosis and reasoning required for complex real-world concurrency bugs.

Takeaways

Our findings reveal both the value and limitations of verification tooling for AI coding agents:

1. Stronger Verification Tools Are Necessary to Expand SWE-Agent Evaluations

Unit testing alone is insufficient for evaluating software engineering agents across many domains. In concurrency, the non-deterministic nature of thread interleavings means that standard test suites can pass even when bugs remain unfixed. Our results demonstrate that controlled concurrency testing tools like Fray are essential for reliable verification.

Concurrency is unlikely to be the only domain where standard testing falls short. We expect similar challenges in other areas that involve inherent non-determinism, such as distributed systems and date/time. To meaningfully expand the scope of SWE-agent evaluations beyond deterministic bug fixes, we may need to find creative verification solutions tailored to each problem domain's unique challenges.

2. SWE-Agent Concurrency Reasoning Needs Fundamental Improvements

The low real-world performance with or without Fray reveals a critical gap that agents struggle to find the root causes of complex concurrency bugs. Simply reading through code and running unit tests is insufficient to provide the necessary feedback. Improving concurrency reasoning will likely require better debugging tooling and feedback mechanisms beyond what's currently available. We may potentially need ways for agents to:

Try It Yourself

Spaghetti Bench is open source and available on GitHub. You can:

Future Work

We plan to expand Spaghetti Bench by adding more real-world concurrency bugs to the dataset and extending support to other languages with mature concurrency testing tools, such as Rust with Shuttle.

Conclusion

Concurrency bugs represent a unique challenge for AI coding agents because standard verification approaches—running tests multiple times—are insufficient most of the time. Our evaluation shows that controlled concurrency testing tools like Fray are essential for expanding the scope of software engineering agent evaluation by providing more reliable verification and refining initial insufficient patches proposed by agents.

As software engineering agents become more capable of producing entire software repositories rapidly and cheaply, equipping them with the right specialized tools for validating solutions in various problem domains becomes increasingly important. Concurrency is just one example, but we expect similar patterns in other areas where standard testing is insufficient.

References

  1. Jimenez, Carlos E., et al. "Swe-bench: Can language models resolve real-world github issues?" arXiv preprint arXiv:2310.06770 (2023).
  2. Li, Ao, et al. "Fray: An Efficient General-Purpose Concurrency Testing Platform for the JVM." Proceedings of the ACM on Programming Languages 9.OOPSLA2 (2025): 4035-4063.
  3. Thomson, Paul, Alastair F. Donaldson, and Adam Betts. "Concurrency testing using schedule bounding: An empirical study." Proceedings of the 19th ACM SIGPLAN symposium on Principles and practice of parallel programming (2014).