The thesis of this work is that interactive debugging facilitates localization with few interactions. Users can explore the proof tree either top-down, from the root, or bottom-up, from failed obligations. An error is always discoverable by doing a full traversal of the top-down tree—this however does not make a quality debugging experience. Discussed in
RQ1 How many cognitive steps does it take to reach the root cause in the top-down and bottom-up views?
RQ2 Does our ranking heuristic reduce the number of cognitive steps in the bottom-up view?
Procedure
A 2022 Rust Foundation grant tasked a member of the community to improve trait error diagnostics. The solution proposed in this work is discussed later, see
| Crate | Description |
|---|---|
| uom | Units of measurement |
| typed-builder | Compile-time type-checked builder derive |
| easy-ml | Machine learning library providing matrices, named tensors, linear algebra and automatic differentiation aimed at being easy to use |
| diesel | A safe, extensible ORM and Query Builder for PostgreSQL, SQLite, and MySQL |
| chumsky | A parser library for humans with powerful error recovery |
| bevy | A refreshingly simple data-driven game engine and app framework |
| axum | Web framework that focuses on ergonomics and modularity |
| entrait | Loosely coupled Rust application design made easy |
The notion of a root cause is important for our evaluation. Each test was analyzed and the root cause was predetermined by the authors. In some cases tests from the community suite were split and modified to contain exactly one error. After analyzing all the community tests several were withdrawn from the Argus suite for one of three reasons. Either the root cause remained ambiguous after test simplification, the behavior of the stable and in-progress trait solver diverged, or a bug within Argus caused evaluation to fail.
Ambiguous root causes
Discussion of a root cause until now has assumed that there exists a single root cause. However, by the nature of some trait errors it may not be possible to present a single failed obligation to a user. To explain how this might happen let us turn our attention once again to the Axum web framework. Recall that in our running example the handler function did not implement the trait Handler due to an incorrect parameter type. This example reflects the type of trait error with a root cause, reflecting the kinds of errors we can evaluate within the scope of this work. Shown in
use axum::{body::Bytes, routing::post, Router};
use tokio::net::TcpListener;
struct A;
#[tokio::main]
async fn main() {
let app = Router::new().route("/login", post(A));
let listener = TcpListener::bind("0.0.0.0:3000")
.await.unwrap();
axum::serve(listener, app).await.unwrap();
}Argus reports the error list shown in
From the initial test suite six test cases were removed due to an ambiguous root cause.
Divergent Errors
Argus requires a nightly version of Rust with the in-progress trait solver enabled. This solver still has known unsoundness and incompleteness issues. Any test case within the suite that resulted in a different trait error, or produced an incorrect result by the definition of the Rust semantics, was thrown out. From the initial test suite only two tests were removed for this reason.
Argus Bugs
Understatement of the century? Despite what they authors want to believe, even research software can contain bugs. Two tests from the original community suite were removed due to a bug in the Argus implementation. It should be noted that neither of these pose threats to the validity of the software but highlight engineering difficulties. (Notably in serialization.)
After splitting and filtering the community suite, there remained 12 test cases on which we ran the evaluation. The root causes for each test were determined by the authors and written in plain text in the Argus test suite. We use an automated test runner to load each workspace into a Playwright chromium instance where Argus is opened. Our tool expands the bottom up and top-down versions of the proof tree views and textually searches for the root cause. Nodes in the bottom-up view are evaluated on rank, i.e., their position in the list. Nodes in the top-down view are evaluated by the number of “cognitive steps” away from the tree node they are. We assume that developers know exactly which tree nodes to expand and read strictly from top-to-bottom. This measure is therefore a lower bound for the top-down debugging process. We define the cognitive steps a developer would have to take as reading or clicking on a node in the proof tree. The nodes in the top-down view were manually checked by the authors to ensure a correct count of cognitive steps.
Results
Shown above is the cumulative distribution of the fraction of tasks for the bottom-up and top-down metrics. Left: the fraction of tests such that the root cause was at rank \(\lt K\). Remember that \(K\) is the position in the bottom-up list. Right: the fraction of tests such that the root cause was \(\lt S\) cognitive steps from the top-down tree root. Bottom: node count in the proof trees by library.
Above, the maroon line shows the cumulative distribution of the fraction of tests such that the root cause was at rank \(\lt K\) when shuffled randomly. Shown in blue, for contrast, the CDF for nodes ranked by the Argus heuristic.
Analysis
RQ1 The comparison between bottom-up and top-down nodes shows a substantive improvement in the number of interactions required before encountering the root cause. This is supported by the assumption made about the cognitive steps to reach an error node. We’ve assumed that the developer will always make the right decision when expanding the tree, making \(S\) a lower bound on the cognitive interactions required. In practice, the proof trees can have a high branching factor and developers unfamiliar with the library internals could easily expand subtrees in the wrong direction. We point readers to
RQ2 The heuristic ablation data,
This is not to say there isn’t room for improvement. Comparing the number of interface interactions between the bottom-up versus top-down we see that in some cases it is easier to traverse the tree. Again, the cognitive steps is a lower bound, but the data highlights that there is room for improvement in the heuristics. The cumulative distribution by library supports the need for a more sophisticated analysis. Tests with Diesel perform much worse than those involving Axum, Bevy, or nAlgebra. Further investigation is required to understand the discrepancy. One intuition may be the high variance in proof tree size,