Evaluation | Gavin Gray

The thesis of this work is that interactive debugging facilitates localization with few interactions. Users can explore the proof tree either top-down, from the root, or bottom-up, from failed obligations. An error is always discoverable by doing a full traversal of the top-down tree—this however does not make a quality debugging experience. Discussed in are the heuristics Argus employs to rank failed obligations in the bottom-up view. We want users to encounter the root cause as soon as possible. This work evaluates Argus on two research questions:

RQ1 How many cognitive steps does it take to reach the root cause in the top-down and bottom-up views?

RQ2 Does our ranking heuristic reduce the number of cognitive steps in the bottom-up view?

Procedure

A 2022 Rust Foundation grant tasked a member of the community to improve trait error diagnostics. The solution proposed in this work is discussed later, see , but the grant awardees assembled a mass of difficult-to-debug trait-error-laden programs from the community—in other words, an Argus test.https://github.com/weiznich/rust-foundation-community-grant This suite relies on the trait-heavy crates, such as the previously mentioned Axum and Diesel. shows the full list of included crates. It is interesting that many of the mentioned libraries boast being “simple,” or “for humans.” Yet their usage obfuscates diagnostics and make debugging difficult.

Crate	Description
uom	Units of measurement
typed-builder	Compile-time type-checked builder derive
easy-ml	Machine learning library providing matrices, named tensors, linear algebra and automatic differentiation aimed at being easy to use
diesel	A safe, extensible ORM and Query Builder for PostgreSQL, SQLite, and MySQL
chumsky	A parser library for humans with powerful error recovery
bevy	A refreshingly simple data-driven game engine and app framework
axum	Web framework that focuses on ergonomics and modularity
entrait	Loosely coupled Rust application design made easy

The notion of a root cause is important for our evaluation. Each test was analyzed and the root cause was predetermined by the authors. In some cases tests from the community suite were split and modified to contain exactly one error. After analyzing all the community tests several were withdrawn from the Argus suite for one of three reasons. Either the root cause remained ambiguous after test simplification, the behavior of the stable and in-progress trait solver diverged, or a bug within Argus caused evaluation to fail.

Ambiguous root causes

Discussion of a root cause until now has assumed that there exists a single root cause. However, by the nature of some trait errors it may not be possible to present a single failed obligation to a user. To explain how this might happen let us turn our attention once again to the Axum web framework. Recall that in our running example the handler function did not implement the trait Handler due to an incorrect parameter type. This example reflects the type of trait error with a root cause, reflecting the kinds of errors we can evaluate within the scope of this work. Shown in is example code that does not produce a single root cause failure.

use axum::{body::Bytes, routing::post, Router};
use tokio::net::TcpListener;

struct A;

#[tokio::main]
async fn main() {
    let app = Router::new().route("/login", post(A));
    let listener = TcpListener::bind("0.0.0.0:3000")
      .await.unwrap();
    axum::serve(listener, app).await.unwrap();
}

In the developer attempts to use struct A as a handler. This struct doesn’t satisfy any of the handler requirements and therefore its usage results in an error. Note that this error doesn’t have a specific failure. A is not a function, regardless of arity. A does not implement IntoResponse. If a value can be turned directly into a response it is usable as a sort of static handler. A subtle detail that hasn’t been important for previous examples. It isn’t that a single obligation fails, but rather they all fail.

Argus reports the error list shown in . It is the sum of all these failed obligations that describe the failure, A is incompatible as a handler in every way possible.

From the initial test suite six test cases were removed due to an ambiguous root cause.

Divergent Errors

Argus requires a nightly version of Rust with the in-progress trait solver enabled. This solver still has known unsoundness and incompleteness issues. Any test case within the suite that resulted in a different trait error, or produced an incorrect result by the definition of the Rust semantics, was thrown out. From the initial test suite only two tests were removed for this reason.

Argus Bugs

Understatement of the century? Despite what they authors want to believe, even research software can contain bugs. Two tests from the original community suite were removed due to a bug in the Argus implementation. It should be noted that neither of these pose threats to the validity of the software but highlight engineering difficulties. (Notably in serialization.)

After splitting and filtering the community suite, there remained 12 test cases on which we ran the evaluation. The root causes for each test were determined by the authors and written in plain text in the Argus test suite. We use an automated test runner to load each workspace into a Playwright chromium instance where Argus is opened. Our tool expands the bottom up and top-down versions of the proof tree views and textually searches for the root cause. Nodes in the bottom-up view are evaluated on rank, i.e., their position in the list. Nodes in the top-down view are evaluated by the number of “cognitive steps” away from the tree node they are. We assume that developers know exactly which tree nodes to expand and read strictly from top-to-bottom. This measure is therefore a lower bound for the top-down debugging process. We define the cognitive steps a developer would have to take as reading or clicking on a node in the proof tree. The nodes in the top-down view were manually checked by the authors to ensure a correct count of cognitive steps.

Results

Shown above is the cumulative distribution of the fraction of tasks for the bottom-up and top-down metrics. Left: the fraction of tests such that the root cause was at rank \(\lt K\). Remember that \(K\) is the position in the bottom-up list. Right: the fraction of tests such that the root cause was \(\lt S\) cognitive steps from the top-down tree root. Bottom: node count in the proof trees by library.

Above, the maroon line shows the cumulative distribution of the fraction of tests such that the root cause was at rank \(\lt K\) when shuffled randomly. Shown in blue, for contrast, the CDF for nodes ranked by the Argus heuristic.

Analysis

RQ1 The comparison between bottom-up and top-down nodes shows a substantive improvement in the number of interactions required before encountering the root cause. This is supported by the assumption made about the cognitive steps to reach an error node. We’ve assumed that the developer will always make the right decision when expanding the tree, making \(S\) a lower bound on the cognitive interactions required. In practice, the proof trees can have a high branching factor and developers unfamiliar with the library internals could easily expand subtrees in the wrong direction. We point readers to (bottom) where the size of proof tree is shown by library. Axum has the fewest nodes per tree on average, 1758, which is still a large amount of data in which to potentially get lost. The bottom-up is crucial to get developers the information they need faster.

RQ2 The heuristic ablation data, (left), show that randomly sorting the error nodes in the bottom-up list is insufficient. The Argus heuristics do, in fact, matter and additional 60% of test cases had the root cause shown first to the user.

This is not to say there isn’t room for improvement. Comparing the number of interface interactions between the bottom-up versus top-down we see that in some cases it is easier to traverse the tree. Again, the cognitive steps is a lower bound, but the data highlights that there is room for improvement in the heuristics. The cumulative distribution by library supports the need for a more sophisticated analysis. Tests with Diesel perform much worse than those involving Axum, Bevy, or nAlgebra. Further investigation is required to understand the discrepancy. One intuition may be the high variance in proof tree size, shows a node count range 250–47000 for Diesel test cases.