OSS maintenance triage, research metrics, and live repository overview

ML Evaluation Workspace

Keep the model story clear: one page for the signal, one for the repos, one for the artifacts.

Overview stays focused on AUROC, Brier, inactivity rate, and calibration. Dataset, Repositories, and Runs hold the deeper inspection views so the workflow stays readable.

Overview

Live quality snapshot

Dataset

Coverage and feature inventory

Repos

Stars, notes, and activity

Runs

Cached artifacts and splits

No cached run

Latest training result without the dashboard sprawl.

Trigger training here, keep the key metrics above the fold, and push data inspection and artifact history into their own pages.

Training base

0 snapshots

0 repositories in the current base

Dataset hash

Pending

No cached artifact yet

Time-aware split

Pending

Split appears after the first completed run

Inspect dataset Inspect repos Inspect runs

Latest Artifact

What changed in the current training picture

Model

Waiting for first run

Observed window

Waiting for first completed artifact

Labeled rows

0 labeled / 0 total

Feature count

0 features in the latest artifact

Trigger the first run to cache a live evaluation artifact and populate the dataset and run-history pages.

Subpages reduce clutterToasts handle action feedbackOverview stays metric-first

Quality

Pending

Combined held-out score from AUROC skill and Brier skill.

AUROC

Pending

Ranking quality on the held-out evaluation slice.

Brier

Pending

Calibration-sensitive probability error. Lower is better.

Inactive 12m rate

Pending

Positive-label pressure in the current held-out slice.

Pending

Thresholded balance of precision and recall.

Precision

Pending

How often predicted inactivity is correct.

Recall

Pending

How much true inactivity the model is catching.

Log loss

Pending

Penalty for overconfident wrong probabilities.

Calibration

No calibration artifact yet

Once a completed run produces evaluation bins, the reliability curve will render here from the cached artifact.

Metric Guide

Read the top-line metrics without leaving the page

Quality

A held-out summary score that combines AUROC skill with Brier skill. It is useful for comparison, not as a standalone proof.

AUROC

How well the model ranks riskier dependencies ahead of less risky ones across thresholds. Higher is better.

Brier score

A calibration-sensitive probability error metric. Lower is better, which is why it matters for thesis-style reliability claims.

Inactive 12m rate

The share of positive labels in the evaluation slice. It tells you how much true inactivity pressure is present in the held-out set.

The balance between precision and recall once a classification threshold is chosen.

Precision

Of the dependencies flagged as risky, how many truly belong in that slice.

Recall

Of the truly risky dependencies, how many the model successfully catches.

Calibration

Whether predicted probabilities match real observed rates. Good calibration makes a score easier to trust in triage.

Metric history

No cached run history yet

Once you have more than one cached training run with metrics, the run-history trend chart will appear here.