Metrics and Quality Assurance for Connectome Proofreading

Instructor Notes

This document is a standalone instructor script. It provides the full mathematical framework, intuitive explanations, worked numerical examples, and practical guidance on designing QA systems. The math is presented at a level accessible to students with basic probability and information theory background; provide additional scaffolding for younger or less mathematical audiences.


1. Why Metrics Matter

1.1 The Problem with Subjective Quality

Without quantitative measures, proofreading quality is a matter of opinion. “It looks pretty good” is not a publishable quality statement. Metrics enable:

1.2 No Single Metric Is Sufficient

Each metric captures a different aspect of quality. A segmentation can score well on one metric and poorly on another. Understanding what each metric measures – and what it misses – is essential.


2. Variation of Information (VI)

2.1 Definition

Variation of Information is an information-theoretic measure of the distance between two clusterings (segmentations). Given a predicted segmentation S and a ground-truth segmentation T over the same set of voxels:

VI(S, T) = H(S|T) + H(T|S)

where H(S|T) is the conditional entropy of S given T, and H(T|S) is the conditional entropy of T given S.

2.2 Intuition

2.3 Mathematical Detail

Let N be the total number of voxels. Let p_i = S_i /N be the fraction
of voxels in predicted segment i, and q_j = T_j /N be the fraction in
ground-truth segment j. Let r_ij = S_i intersect T_j /N.
H(S|T) = - sum_{i,j} r_ij * log(r_ij / q_j)
H(T|S) = - sum_{i,j} r_ij * log(r_ij / p_i)

2.4 Properties

2.5 Limitations

2.6 Instructor Tip

Present VI as the “gold standard” metric for segmentation benchmarks (used in CREMI, SNEMI3D challenges) but explain that its biological interpretability is limited. Students should be able to compute it and interpret which component (split vs. merge) is dominant, but should not rely on it alone.


3. Expected Run Length (ERL)

3.1 Definition

Expected Run Length is the average distance (in micrometers) that you can trace along a ground-truth neurite before encountering a topological error (merge or split) in the predicted segmentation. Introduced by Funke et al. (2017).

3.2 Computation

  1. Take the ground-truth skeleton of each neuron.
  2. Sample random paths along the skeleton (e.g., random walks from random starting nodes).
  3. For each path, walk along it and check the predicted segmentation at each node:
    • If the predicted label changes (but the ground-truth neuron has not ended), you have hit a split error. Record the distance traveled.
    • If a different ground-truth neuron shares the same predicted label, you have detected a merge error. Record the distance traveled.
  4. ERL is the average of all recorded distances, weighted by path length.

3.3 Intuition

ERL answers the practical question: “If I pick a random point on a random neuron and start tracing, how far can I go before the segmentation misleads me?”

3.4 Properties

3.5 Limitations


4. Edge Precision and Recall

4.1 Definition

Treat the connectome as a directed graph where each edge represents a synaptic connection from neuron A to neuron B. Compare the predicted graph to the ground-truth graph:

Then:

Precision = TP / (TP + FP)
Recall    = TP / (TP + FN)
F1        = 2 * Precision * Recall / (Precision + Recall)

4.2 Intuition

4.3 Relationship to Error Types

This direct mapping to error types makes edge metrics highly actionable. Cite Schneider-Mizell et al. (2016) for the framework.

4.4 Limitations


5. Synapse-Centric Precision and Recall

5.1 Definition

Similar to edge metrics but evaluated at the individual synapse level. For each synapse in the ground truth:

5.2 Why Synapse-Level Matters

Consider two neurons, A and B, connected by 5 synapses. If a boundary error shifts one synapse from B to a neighboring neuron C:

Synapse-level metrics are more granular and capture errors that edge-level metrics miss.

5.3 The Metric Most Relevant to Connectomics

For most connectome analyses – computing connection strengths, identifying motifs, modeling circuit function – synapse-level accuracy is the ultimate measure of quality. If every synapse is correctly assigned, the connectome is correct regardless of any morphological imperfections.


6. Completeness Metrics

6.1 Neuron Completeness

What fraction of neurons in the volume are fully reconstructed (no split errors, no merge errors, correct morphology)?

6.2 Volume Coverage

What fraction of the total volume has been proofread?

6.3 Segment Size Distribution

Compare the size distribution of segments before and after proofreading:

A shift toward fewer extreme outliers (both small and large) indicates effective proofreading.


7. Dashboard Design

7.1 What a Proofreading QA Dashboard Should Show

A well-designed dashboard enables supervisors and proofreaders to monitor quality in real time. Essential components:

Per-region metrics panel:

Temporal trends panel:

Annotator performance panel:

Cost tracking panel:

7.2 Instructor Tip

Show students an example dashboard (even a mockup) and ask them to interpret it. “Region A has VI_merge = 0.01 but VI_split = 0.08. Region B has VI_merge = 0.06 and VI_split = 0.02. Which region needs more merge fixes? Which needs more split fixes? Where would you allocate proofreading effort?”


8. Worked Example: Computing VI and ERL on a Small Example

8.1 Setup

Consider a tiny volume with 100 voxels and 3 ground-truth neurons:

The predicted segmentation has 3 segments:

8.2 Computing VI

First, compute the overlap matrix r_ij = S_i intersect T_j / N:
  T1 T2 T3
S1 0.40 0.10 0.00
S2 0.00 0.20 0.00
S3 0.00 0.00 0.30

Marginals: p1=0.50, p2=0.20, p3=0.30; q1=0.40, q2=0.30, q3=0.30.

H(T|S) (merge component): = -[0.40log(0.40/0.50) + 0.10log(0.10/0.50)

This is nonzero because S1 contains voxels from both T1 and T2 (a merge).

H(S|T) (split component): = -[0.40log(0.40/0.40) + 0.10log(0.10/0.30)

This is nonzero because T2 is split across S1 and S2.

VI = 0.361 + 0.276 = 0.637 bits.

Interpretation: the merge component (0.361) is larger than the split component (0.276), indicating that merge errors are the more serious problem in this example.

8.3 Computing ERL (Simplified)

Suppose the ground-truth skeletons have these path lengths:

For T1: S1 contains all of T1, and no other ground-truth neuron shares S1’s label on T1’s skeleton. ERL contribution from T1 = 80 um (no error encountered along T1’s skeleton within S1, because S1 only has merge contamination from T2 voxels, which are not on T1’s skeleton path).

Wait – the merge error means S1 also contains part of T2. If we trace along T1’s skeleton, the predicted label is S1 the entire way. Since no other neuron’s skeleton overlaps with this path in the prediction, there is no merge error detected from T1’s perspective.

For T2: the first 10 um of T2’s skeleton (the portion in S1) has label S1. Then the remaining 50 um (in S2) has label S2. There is a split error at the 10 um mark. So ERL contribution from T2: two runs of 10 um and 50 um. However, the 10 um portion in S1 also constitutes a merge error (S1 contains T1 voxels too), so this run is terminated by both a split and a merge.

For T3: S3 = T3 exactly. ERL contribution = 50 um.

Weighted average ERL = (80 + 10 + 50 + 50) / 4 paths… (The precise calculation depends on the sampling method, but the key point is that T2’s fragmentation reduces the average.)

Simplified ERL estimate: approximately 47 um (indicating that on average you can trace ~47 um before hitting an error).

8.4 Instructor Tip

Walk through this computation on a whiteboard. The numbers are small enough to compute by hand. The key takeaway: VI told us merge > split, while ERL told us that the practical tracing impact is moderate (47 um). Both are useful; neither tells the whole story.


9. When Metrics Disagree

9.1 Good VI, Bad ERL

This happens when errors are few but strategically placed – e.g., a single split in the middle of a long axon. VI sees one small error on a volumetric basis (tiny fraction of voxels affected), but ERL sees a neuron cut in half (every trace along that axon hits the split).

Which to trust: If your question is about tracing or morphology, trust ERL. If your question is about overall volumetric accuracy, trust VI.

9.2 Good ERL, Bad VI

This happens when there are many small boundary errors that shift segment borders by a few voxels each. ERL does not detect these because the skeleton stays within the correct segment, but VI accumulates the voxel misassignments across the entire volume.

Which to trust: If your question is about synapse assignment or fine morphology, the boundary errors captured by VI matter. If your question is about connectivity topology, ERL is more relevant.

9.3 Good Voxel Metrics, Bad Edge Metrics

This happens when the segmentation is volumetrically accurate (low VI) and topologically sound (high ERL), but synapse detection or assignment is poor. The segments are correct, but the connections between them are not.

Lesson: Always report both segmentation metrics (VI, ERL) and connectivity metrics (edge F1, synapse precision/recall). They measure different things.


10. Setting Quality Standards

10.1 Published Benchmarks

Dataset VI (bits) ERL (um) Edge F1 Reference
CREMI challenge (best) ~0.10 ~150 N/A CREMI leaderboard
FlyWire (proofread) ~0.05 >200 ~0.85 Dorkenwald et al. (2024)
MICrONS (proofread) ~0.08 ~120 ~0.80 MICrONS Consortium (2021)
Hemibrain (proofread) ~0.06 ~180 ~0.82 Scheffer et al. (2020)

These numbers are approximate and depend on the evaluation region, ground truth quality, and computation details. They provide rough targets for new projects.

10.2 Setting Your Own Targets

Targets should be driven by the scientific question:


11. References


End of instructor script: Metrics and Quality Assurance for Connectome Proofreading