Data Formats and Representations

Overview

A connectomics dataset is not one thing — it is a family of representations at different levels of abstraction. Raw images, segmentation volumes, surface meshes, morphological skeletons, and connectivity graphs each capture different aspects of the same underlying biology. Choosing the right representation for a given task is a core technical skill, because each format has characteristic strengths, blind spots, and computational costs.

The representation hierarchy

Raw EM images (voxels)
    ↓ segmentation
Labeled volumes (voxel → segment ID)
    ↓ surface extraction
Meshes (triangulated surfaces)
    ↓ skeletonization
Skeletons (tree graphs with spatial coordinates)
    ↓ synapse assignment
Connectome graph (neurons as nodes, synapses as edges)

Each arrow is an information-reducing transformation. You gain computational efficiency and analytical clarity, but you lose spatial detail. The key question is: what information do you need for your analysis, and what is the cheapest representation that preserves it?

Volumetric data

What it is

The most fundamental representation: a 3D array of voxel intensities (raw images) or voxel labels (segmentation). Every spatial position has a value.

Formats

Format	Description	Typical use
Neuroglancer precomputed	Chunked, multiscale image pyramid served over HTTP	Web-based browsing (Neuroglancer, Spelunker)
N5	Chunked, compressed, hierarchical format (Java/Python)	Pipeline intermediate storage
Zarr	Python-native chunked array format, cloud-friendly	Analysis, cloud storage (S3, GCS)
HDF5	Hierarchical Data Format, self-describing	Legacy, local analysis
TIFF stacks	Uncompressed or LZW-compressed image stacks	Raw microscope output, small datasets

Key properties

Chunking: Large volumes are divided into chunks (e.g., 128³ or 256³ voxels). Chunks are the unit of I/O — you load one chunk at a time, not the whole volume. Chunk size affects performance: larger chunks = fewer I/O operations but more wasted bandwidth if you only need a small region.
Multi-resolution pyramids: Store the same volume at multiple resolutions (full res, 2× downsampled, 4×, 8×…). Enables efficient browsing — you see the overview at low resolution and zoom into high resolution on demand.
Compression: Typical compression ratios of 2-10× for EM data (depending on algorithm: gzip, lz4, zstd, JPEG for lossy). Segmentation volumes compress much better than raw images (large uniform regions).

When to use volumetric data

Raw image inspection and quality control
Running segmentation or synapse detection models (need voxel-level input)
Proofreading (need to see images + segmentation overlay)
Any analysis requiring spatial context that meshes or skeletons don’t preserve

Limitations

Storage: A 1 mm³ volume at 4 nm resolution is ~10^13 voxels, ~10 TB at 8-bit. With segmentation (32-bit or 64-bit labels), double or quadruple that.
Query efficiency: “Which neurons are within 10 μm of this synapse?” requires scanning voxels unless you also maintain a spatial index.

Surface meshes

What they are

Triangulated surfaces that represent the boundary of each segmented object. Each mesh is a set of vertices (3D points) and faces (triangles connecting vertices).

How they’re generated

Marching cubes algorithm (or variants) applied to the segmentation volume. For each segment, extract the isosurface at the boundary between that segment and its neighbors. The result is a watertight mesh (ideally).

Formats

Format	Description
OBJ	Simple text format, widely supported
PLY	Binary or text, supports vertex attributes (colors)
STL	Binary triangle format, common in 3D printing
Neuroglancer mesh	Chunked, multi-resolution mesh format for web rendering
DRACO	Google’s compressed mesh format, used in Neuroglancer

Key properties

Level of detail (LOD): Store meshes at multiple simplification levels. Full-resolution meshes for a large neuron can have millions of triangles — impractical for real-time rendering. Decimated meshes (10K-100K triangles) are used for overview visualization.
Vertex attributes: Meshes can carry per-vertex data (e.g., distance from soma, local curvature, synapse density) for visualization and analysis.

When to use meshes

3D visualization of neuron morphology
Surface area and volume measurements
Spine detection (local curvature analysis on dendritic surfaces)
Spatial proximity analysis between neurons
Proofreading — 3D mesh view reveals impossible morphology (merge errors) that is hard to see in 2D slices

Limitations

Lose internal structure (organelle distributions, cytoplasmic features)
Mesh topology errors (self-intersections, holes) can arise from noisy segmentation boundaries
Large storage for complex neurons (a single pyramidal cell mesh can be >100 MB at full resolution)

Skeletons

What they are

Tree-graph representations of neuron morphology. Each skeleton is a set of nodes (3D coordinates along the neurite centerline) connected by edges (parent-child relationships). The root is typically the soma, and branches represent dendrites and axons.

How they’re generated

From volumes: Thinning/skeletonization algorithms (e.g., TEASAR — Sato et al. 2000) reduce the volumetric segment to its medial axis.
From meshes: Contract the mesh surface to extract the centerline.
Manual tracing: Historically, skeletons were traced manually in tools like CATMAID.

Formats

Format	Description
SWC	Standard text format for neuron morphologies. Each line: ID, type, x, y, z, radius, parent_ID. Widely supported by morphology tools (NeuroM, Neurolucida, NEURON simulator).
JSON skeleton	Used by Neuroglancer and CloudVolume
CATMAID skeleton	Database-backed skeleton with annotations

Key properties

Compact: A neuron that occupies millions of voxels in volumetric form is represented by thousands of skeleton nodes (~KB vs GB).
Topologically explicit: Branch points, terminal points, and path lengths are directly readable.
Morphometric analysis: Cable length, branch order, Strahler number, bifurcation angles, tortuosity — all computed directly from skeletons.
Radius information: SWC format includes radius at each node, preserving approximate process caliber.

When to use skeletons

Morphological analysis (total cable length, branch complexity, Sholl analysis)
Cell-type classification based on morphology
Path-length measurements between synapses
Input to biophysical simulation (NEURON, Brian)
Efficient error detection (skeleton shows impossible topology)

Limitations

Lose surface geometry: Spine morphology, surface area, local curvature not captured
Lose volume information: Can’t compute volume-based measurements
Skeletonization errors: Thin processes may be skipped, branch points mislocated, spurious branches created from noisy segmentation
Radius approximation: SWC radius is a single value per node (circular cross-section assumption), which doesn’t capture irregular shapes

Connectome graphs

What they are

The highest-level representation: neurons as nodes, synaptic connections as edges. This is the “connectome” — the wiring diagram.

How they’re constructed

Each segmented neuron = one node
Each detected synapse → identify pre-synaptic and post-synaptic segments → create directed edge from pre to post
Aggregate: multiple synapses between the same pair → edge weight = synapse count (or sum of cleft areas)

Formats

Format	Description
Edge list (CSV/TSV)	Simple: pre_id, post_id, weight, synapse_count
Adjacency matrix (NumPy/sparse)	N×N matrix, good for linear algebra
GraphML / GEXF	XML-based, supports node/edge attributes
NetworkX pickle	Python-native, good for analysis
Neo4j / graph database	Queryable graph store for large connectomes

Node attributes

Cell type (morphological or transcriptomic classification)
Soma position (x, y, z)
Morphological features (cable length, spine density, arbor volume)
Functional properties (tuning curves from calcium imaging, if available)

Edge attributes

Synapse count
Total cleft area or PSD area
Synapse type (excitatory/inhibitory)
Spatial locations of individual synapses
Confidence score

When to use graphs

Connectivity analysis (degree distributions, clustering, motifs)
Circuit identification (find all neurons in a pathway)
Comparison across datasets or conditions
Input to network models (spiking simulations, dynamical systems)

Limitations

Lose all spatial information (unless node/edge positions are stored as attributes)
Lose morphological detail — a graph edge between two neurons doesn’t tell you whether the synapse is on a proximal dendrite or a distal spine
Thresholding dependence — decisions about minimum synapse count for an “edge” dramatically affect graph structure
Error amplification — segmentation and synapse detection errors both corrupt the graph

Worked example: choosing a representation

Question: “Do inhibitory interneurons preferentially target the perisomatic region of pyramidal cells in layer 2/3?”

Analysis needs:

Identify inhibitory and excitatory neurons → need cell-type labels (graph node attributes)
Find synapses between inhibitory → pyramidal pairs → need connectome graph edges
Determine synapse location on the pyramidal cell (perisomatic vs distal dendrite) → need synapse spatial coordinates mapped onto the pyramidal cell morphology

Representation choice: This question requires the connectome graph (for connectivity) plus skeletons (for distance-from-soma measurement at each synapse location). Neither the graph alone (no spatial synapse info) nor the volume alone (too expensive for the network-level query) would suffice.

References

Dorkenwald S et al. (2022) “CAVE: Connectome Annotation Versioning Engine.” bioRxiv.
Sato M et al. (2000) “TEASAR: Tree-structure extraction algorithm for accurate and robust skeletons.” Proc. Pacific Conference on Computer Graphics and Applications.
Rubinov M, Sporns O (2010) “Complex network measures of brain connectivity: Uses and interpretations.” NeuroImage 52(3):1059-1069.
Scheffer LK et al. (2020) “A connectome and analysis of the adult Drosophila central brain.” eLife 9:e57443.