Crystalia Data Model¶
Crystalia is a framework for constructing knowledge graphs that describe large datasets using agent-driven, probabilistic document comparison. This LinkML schema defines the data model at the heart of the framework, capturing how files and datasets are annotated with multi-level descriptors for fast, parallel integrity verification and comparison.
Purpose¶
Parallel, Distributed Dataset Comparison¶
Traditional file hashing is sequential — you must read the entire file to produce a digest. Crystalia descriptors can be constructed in a parallel, distributed fashion, enabling comparison of very large datasets without bottlenecks. The framework supports multiple signature methods, from cryptographic hashes to lightweight checksums, with explicit robustness classification.
Knowledge Graph Interoperability¶
By defining the data model in LinkML, Crystalia generates artifacts for a wide variety of tools: Pydantic models for collectors and reasoners, Protobuf for high-performance pipelines, OWL for semantic reasoning, TypeScript for web applications, and SHACL for RDF validation. One schema drives them all.
Beyond Checksums and Catalogs¶
Unlike DCAT (dataset metadata catalogs) or BagIt (archival manifest checksums), Crystalia provides multi-level descriptor granularity, parallel computation, and a robustness classification that lets agents reason about the reliability of different comparison strategies.
Schema Overview¶
The core schema is defined in linkml_crystalia.yaml.
| Count | |
|---|---|
| Classes | 6 |
| Slots | 13 |
| Enums | 1 |
Here is the ERD diagram of the schema structure, showing classes and their relationships. For a complete reference, see the Data Dictionary.
Core Entities¶
| Class | Description |
|---|---|
| Item | An individual item (e.g. a file) that can be described by descriptors |
| Descriptor | A signature for the whole or part of an item, with offset, length, and coverage |
| DescriptorType | Configuration for a kind of descriptor, linking it to a generation method |
| Method | A method used to generate descriptors (e.g. SHA-256, CRC-32) with a robustness rating |
Relationships¶
- Annotation: Items have descriptors (
hasDescriptor), and descriptors can annotate other descriptors - Typing: Each descriptor has a type (
hasType) that defines how it was generated - Method: Each descriptor type uses a specific method (
usesMethod) with a robustness classification - Containment: Items can be part of larger datasets (
isPartOf)
Example: SHA-256 Descriptor on a File¶
id: cryd:sample-image-001
label: crystal_specimen_2024.tiff
isPartOf: cryd:dataset-mineral-survey
hasDescriptor:
- id: crys:desc-sha256-full-001
hasType: crys:dtype-sha256sum
value: "e3b0c44298fc1c149afbf4c8996fb92427ae41e4649b934ca495991b7852b855"
offset: 0
length: 1048576
coverage: 1.0
For a complete reference, see the Data Dictionary.
Generated Artifacts¶
The following artifacts are generated from the schema and available in the project/ directory:
| Artifact | File | Description |
|---|---|---|
| OWL | linkml_crystalia.owl.ttl |
OWL ontology in Turtle format |
| Pydantic | linkml_crystalia.py |
Python Pydantic v2 models |
| Protobuf | linkml_crystalia.proto |
Protocol Buffers schema |
| JSON Schema | linkml_crystalia.schema.json |
JSON Schema for data validation |
| JSON-LD | linkml_crystalia.context.jsonld |
JSON-LD context for linked data |
| SHACL | linkml_crystalia.shacl.ttl |
SHACL shapes for RDF validation |
| TypeScript | linkml_crystalia.ts |
TypeScript type definitions |
Companion Tools¶
The Crystalia ecosystem includes collectors, reasoners, and browsers that consume this data model:
- Crystalia Collector — generates descriptors from files and datasets
Additional tools are under active development.
Data Dictionary¶
The Data Dictionary provides the complete reference for all classes, slots, enumerations, and type definitions in the schema, including class diagrams and ERD visualizations.
Quick Start¶
# Clone the repository
git clone https://github.com/vladistan/crystalia-data-model.git
cd crystalia-data-model
# Install dependencies
uv sync --group dev
# Generate all artifacts from the schema
just gen-project
# Serve the documentation locally
just serve-docs
# Run tests
just test
Citation¶
If you use Crystalia in your research, please cite:
V. Korolev and A. Joshi, "Crystalia: Flexible and Efficient Method for Large Dataset Lineage Tracking," 2024 IEEE International Conference on Big Data (BigData), Washington, DC, USA, 2024, pp. 3431-3440, doi: 10.1109/bigdata62323.2024.10826067.
@inproceedings{korolev2024crystalia,
author = {Korolev, Vladimir and Joshi, Anupam},
title = {Crystalia: Flexible and Efficient Method for Large Dataset Lineage Tracking},
booktitle = {2024 IEEE International Conference on Big Data (BigData)},
year = {2024},
pages = {3431--3440},
doi = {10.1109/bigdata62323.2024.10826067},
address = {Washington, DC, USA}
}
Source¶
- Repository: github.com/vladistan/crystalia-data-model
- Documentation: crystalia.link
- Schema:
src/crystalia_data_model/schema/linkml_crystalia.yaml