Evaluation

The the­sis of this work is that inter­ac­tive debug­ging facil­i­tates local­iza­tion with few inter­ac­tions. Users can explore the proof tree either top-down, from the root, or bot­tom-up, from failed oblig­a­tions. An error is always dis­cov­er­able by doing a full tra­ver­sal of the top-down tree—this how­ever does not make a qual­ity debug­ging expe­ri­ence. Dis­cussed in are the heuris­tics Argus employs to rank failed oblig­a­tions in the bot­tom-up view. We want users to encounter the root cause as soon as pos­si­ble. This work eval­u­ates Argus on two research ques­tions:

RQ1 How many cog­ni­tive steps does it take to reach the root cause in the top-down and bot­tom-up views?

RQ2 Does our rank­ing heuris­tic reduce the num­ber of cog­ni­tive steps in the bot­tom-up view?

Procedure

A 2022 Rust Foun­da­tion grant tasked a mem­ber of the com­mu­nity to improve trait error diag­nos­tics. The solu­tion pro­posed in this work is dis­cussed later, see , but the grant awardees assem­bled a mass of dif­fi­cult-to-debug trait-error-laden pro­grams from the com­mu­nity—in other words, an Argus test.https://github.com/weiznich/rust-foun­da­tion-com­mu­nity-grant This suite relies on the trait-heavy crates, such as the pre­vi­ously men­tioned Axum and Diesel. shows the full list of included crates. It is inter­est­ing that many of the men­tioned libraries boast being “sim­ple,” or “for humans.” Yet their usage obfus­cates diag­nos­tics and make debug­ging dif­fi­cult.

CrateDescription
uomUnits of mea­sure­ment
typed-builderCom­pile-time type-checked builder derive
easy-mlMachine learn­ing library pro­vid­ing matri­ces, named ten­sors, lin­ear alge­bra and auto­matic dif­fer­en­ti­a­tion aimed at being easy to use
dieselA safe, exten­si­ble ORM and Query Builder for Post­greSQL, SQLite, and MySQL
chum­skyA parser library for humans with pow­er­ful error recov­ery
bevyA refresh­ingly sim­ple data-dri­ven game engine and app frame­work
axumWeb frame­work that focuses on ergonom­ics and mod­u­lar­ity
entraitLoosely cou­pled Rust appli­ca­tion design made easy

The notion of a root cause is impor­tant for our eval­u­a­tion. Each test was ana­lyzed and the root cause was pre­de­ter­mined by the authors. In some cases tests from the com­mu­nity suite were split and mod­i­fied to con­tain exactly one error. After ana­lyz­ing all the com­mu­nity tests sev­eral were with­drawn from the Argus suite for one of three rea­sons. Either the root cause remained ambigu­ous after test sim­pli­fi­ca­tion, the behav­ior of the sta­ble and in-progress trait solver diverged, or a bug within Argus caused eval­u­a­tion to fail.

Ambiguous root causes

Dis­cus­sion of a root cause until now has assumed that there exists a sin­gle root cause. How­ever, by the nature of some trait errors it may not be pos­si­ble to present a sin­gle failed oblig­a­tion to a user. To explain how this might hap­pen let us turn our atten­tion once again to the Axum web frame­work. Recall that in our run­ning exam­ple the han­dler func­tion did not imple­ment the trait Handler due to an incor­rect para­me­ter type. This exam­ple reflects the type of trait error with a root cause, reflect­ing the kinds of errors we can eval­u­ate within the scope of this work. Shown in is exam­ple code that does not pro­duce a sin­gle root cause fail­ure.

use axum::{body::Bytes, routing::post, Router};
use tokio::net::TcpListener;

struct A;

#[tokio::main]
async fn main() {
    let app = Router::new().route("/login", post(A));
    let listener = TcpListener::bind("0.0.0.0:3000")
      .await.unwrap();
    axum::serve(listener, app).await.unwrap();
}

In the devel­oper attempts to use struct A as a han­dler. This struct doesn’t sat­isfy any of the han­dler require­ments and there­fore its usage results in an error. Note that this error doesn’t have a spe­cific fail­ure. A is not a func­tion, regard­less of arity. A does not imple­ment IntoResponse. If a value can be turned directly into a response it is usable as a sort of sta­tic han­dler. A sub­tle detail that hasn’t been impor­tant for pre­vi­ous exam­ples. It isn’t that a sin­gle oblig­a­tion fails, but rather they all fail.

Argus reports the error list shown in . It is the sum of all these failed oblig­a­tions that describe the fail­ure, A is incom­pat­i­ble as a han­dler in every way pos­si­ble.

From the ini­tial test suite six test cases were removed due to an ambigu­ous root cause.

Divergent Errors

Argus requires a nightly ver­sion of Rust with the in-progress trait solver enabled. This solver still has known unsound­ness and incom­plete­ness issues. Any test case within the suite that resulted in a dif­fer­ent trait error, or pro­duced an incor­rect result by the def­i­n­i­tion of the Rust seman­tics, was thrown out. From the ini­tial test suite only two tests were removed for this rea­son.

Argus Bugs

Under­state­ment of the cen­tury? Despite what they authors want to believe, even research soft­ware can con­tain bugs. Two tests from the orig­i­nal com­mu­nity suite were removed due to a bug in the Argus imple­men­ta­tion. It should be noted that nei­ther of these pose threats to the valid­ity of the soft­ware but high­light engi­neer­ing dif­fi­cul­ties. (Notably in seri­al­iza­tion.)

After split­ting and fil­ter­ing the com­mu­nity suite, there remained 12 test cases on which we ran the eval­u­a­tion. The root causes for each test were deter­mined by the authors and writ­ten in plain text in the Argus test suite. We use an auto­mated test run­ner to load each work­space into a Play­wright chromium instance where Argus is opened. Our tool expands the bot­tom up and top-down ver­sions of the proof tree views and tex­tu­ally searches for the root cause. Nodes in the bot­tom-up view are eval­u­ated on rank, i.e., their posi­tion in the list. Nodes in the top-down view are eval­u­ated by the num­ber of “cog­ni­tive steps” away from the tree node they are. We assume that devel­op­ers know exactly which tree nodes to expand and read strictly from top-to-bot­tom. This mea­sure is there­fore a lower bound for the top-down debug­ging process. We define the cog­ni­tive steps a devel­oper would have to take as read­ing or click­ing on a node in the proof tree. The nodes in the top-down view were man­u­ally checked by the authors to ensure a cor­rect count of cog­ni­tive steps.

Results

Shown above is the cumu­la­tive dis­tri­bu­tion of the frac­tion of tasks for the bot­tom-up and top-down met­rics. Left: the frac­tion of tests such that the root cause was at rank \(\lt K\). Remem­ber that \(K\) is the posi­tion in the bot­tom-up list. Right: the frac­tion of tests such that the root cause was \(\lt S\) cog­ni­tive steps from the top-down tree root. Bot­tom: node count in the proof trees by library.

Above, the maroon line shows the cumu­la­tive dis­tri­bu­tion of the frac­tion of tests such that the root cause was at rank \(\lt K\) when shuf­fled ran­domly. Shown in blue, for con­trast, the CDF for nodes ranked by the Argus heuris­tic.

Analysis

RQ1 The com­par­i­son between bot­tom-up and top-down nodes shows a sub­stan­tive improve­ment in the num­ber of inter­ac­tions required before encoun­ter­ing the root cause. This is sup­ported by the assump­tion made about the cog­ni­tive steps to reach an error node. We’ve assumed that the devel­oper will always make the right deci­sion when expand­ing the tree, mak­ing \(S\) a lower bound on the cog­ni­tive inter­ac­tions required. In prac­tice, the proof trees can have a high branch­ing fac­tor and devel­op­ers unfa­mil­iar with the library inter­nals could eas­ily expand sub­trees in the wrong direc­tion. We point read­ers to (bot­tom) where the size of proof tree is shown by library. Axum has the fewest nodes per tree on aver­age, 1758, which is still a large amount of data in which to poten­tially get lost. The bot­tom-up is cru­cial to get devel­op­ers the infor­ma­tion they need faster.

RQ2 The heuris­tic abla­tion data, (left), show that ran­domly sort­ing the error nodes in the bot­tom-up list is insuf­fi­cient. The Argus heuris­tics do, in fact, mat­ter and addi­tional 60% of test cases had the root cause shown first to the user.

This is not to say there isn’t room for improve­ment. Com­par­ing the num­ber of inter­face inter­ac­tions between the bot­tom-up ver­sus top-down we see that in some cases it is eas­ier to tra­verse the tree. Again, the cog­ni­tive steps is a lower bound, but the data high­lights that there is room for improve­ment in the heuris­tics. The cumu­la­tive dis­tri­bu­tion by library sup­ports the need for a more sophis­ti­cated analy­sis. Tests with Diesel per­form much worse than those involv­ing Axum, Bevy, or nAl­ge­bra. Fur­ther inves­ti­ga­tion is required to under­stand the dis­crep­ancy. One intu­ition may be the high vari­ance in proof tree size, shows a node count range 250–47000 for Diesel test cases.