Skip to content

Crystalia Data Model

Crystalia is a framework for constructing knowledge graphs that describe large datasets using agent-driven, probabilistic document comparison. This LinkML schema defines the data model at the heart of the framework, capturing how files and datasets are annotated with multi-level descriptors for fast, parallel integrity verification and comparison.

Purpose

Parallel, Distributed Dataset Comparison

Traditional file hashing is sequential — you must read the entire file to produce a digest. Crystalia descriptors can be constructed in a parallel, distributed fashion, enabling comparison of very large datasets without bottlenecks. The framework supports multiple signature methods, from cryptographic hashes to lightweight checksums, with explicit robustness classification.

Knowledge Graph Interoperability

By defining the data model in LinkML, Crystalia generates artifacts for a wide variety of tools: Pydantic models for collectors and reasoners, Protobuf for high-performance pipelines, OWL for semantic reasoning, TypeScript for web applications, and SHACL for RDF validation. One schema drives them all.

Beyond Checksums and Catalogs

Unlike DCAT (dataset metadata catalogs) or BagIt (archival manifest checksums), Crystalia provides multi-level descriptor granularity, parallel computation, and a robustness classification that lets agents reason about the reliability of different comparison strategies.

Schema Overview

The core schema is defined in linkml_crystalia.yaml.

Count
Classes 6
Slots 13
Enums 1

Here is the ERD diagram of the schema structure, showing classes and their relationships. For a complete reference, see the Data Dictionary.

ERD Diagram

Core Entities

Class Description
Item An individual item (e.g. a file) that can be described by descriptors
Descriptor A signature for the whole or part of an item, with offset, length, and coverage
DescriptorType Configuration for a kind of descriptor, linking it to a generation method
Method A method used to generate descriptors (e.g. SHA-256, CRC-32) with a robustness rating

Relationships

  • Annotation: Items have descriptors (hasDescriptor), and descriptors can annotate other descriptors
  • Typing: Each descriptor has a type (hasType) that defines how it was generated
  • Method: Each descriptor type uses a specific method (usesMethod) with a robustness classification
  • Containment: Items can be part of larger datasets (isPartOf)

Example: SHA-256 Descriptor on a File

id: cryd:sample-image-001
label: crystal_specimen_2024.tiff
isPartOf: cryd:dataset-mineral-survey
hasDescriptor:
  - id: crys:desc-sha256-full-001
    hasType: crys:dtype-sha256sum
    value: "e3b0c44298fc1c149afbf4c8996fb92427ae41e4649b934ca495991b7852b855"
    offset: 0
    length: 1048576
    coverage: 1.0

For a complete reference, see the Data Dictionary.

Generated Artifacts

The following artifacts are generated from the schema and available in the project/ directory:

Artifact File Description
OWL linkml_crystalia.owl.ttl OWL ontology in Turtle format
Pydantic linkml_crystalia.py Python Pydantic v2 models
Protobuf linkml_crystalia.proto Protocol Buffers schema
JSON Schema linkml_crystalia.schema.json JSON Schema for data validation
JSON-LD linkml_crystalia.context.jsonld JSON-LD context for linked data
SHACL linkml_crystalia.shacl.ttl SHACL shapes for RDF validation
TypeScript linkml_crystalia.ts TypeScript type definitions

Companion Tools

The Crystalia ecosystem includes collectors, reasoners, and browsers that consume this data model:

Additional tools are under active development.

Data Dictionary

The Data Dictionary provides the complete reference for all classes, slots, enumerations, and type definitions in the schema, including class diagrams and ERD visualizations.

Quick Start

# Clone the repository
git clone https://github.com/vladistan/crystalia-data-model.git
cd crystalia-data-model

# Install dependencies
uv sync --group dev

# Generate all artifacts from the schema
just gen-project

# Serve the documentation locally
just serve-docs

# Run tests
just test

Citation

If you use Crystalia in your research, please cite:

V. Korolev and A. Joshi, "Crystalia: Flexible and Efficient Method for Large Dataset Lineage Tracking," 2024 IEEE International Conference on Big Data (BigData), Washington, DC, USA, 2024, pp. 3431-3440, doi: 10.1109/bigdata62323.2024.10826067.

@inproceedings{korolev2024crystalia,
  author    = {Korolev, Vladimir and Joshi, Anupam},
  title     = {Crystalia: Flexible and Efficient Method for Large Dataset Lineage Tracking},
  booktitle = {2024 IEEE International Conference on Big Data (BigData)},
  year      = {2024},
  pages     = {3431--3440},
  doi       = {10.1109/bigdata62323.2024.10826067},
  address   = {Washington, DC, USA}
}

Source