INTRODUCING
Status: In Design
The Light Keeper is a structured information-management system currently in design, intended to ingest, normalize, index, and retrieve testimony-based media with strict source fidelity. Its purpose is to consolidate a widely distributed dataset into a unified, query-ready archive while preserving the exact wording, chronology, and metadata of every source.
The initial test corpus is the publicly available Bledsoe family material. This dataset was selected because of its technical characteristics:
• it spans nearly two decades of recorded material
• it exists across multiple independent formats (books, interviews, podcasts, livestreams, long-form written posts)
• it continues to grow, making it ideal for testing incremental ingestion
• it contains detailed, long-form accounts well suited to structured retrieval
These properties make the corpus an effective real-world stress test for a system designed to manage multi-format, long-duration testimony data.
The system is being architected as a deterministic pipeline:
Source Acquisition Layer
The system will collect heterogeneous inputs including books, interviews, podcasts, livestreams, and long-form written accounts. Each item becomes a raw “Source Object” with unique identifiers, timestamps, and provenance metadata.
Transcription & Extraction Module
Audio and video inputs will be transcribed through automated speech recognition with optional human verification. Transcripts are segmented into discrete “Statements” that preserve paragraph structure, speaker identity, timestamps, and contextual relationships.
Normalization & Structural Encoding
All content will be normalized into a strict JSON schema. Each Statement will contain:
• exact original wording
• source reference (medium, date, timestamp)
• speaker metadata
• adjacency links (preceding/following statements)
• optional semantic tags generated through model-assisted extraction
No summarization or interpretation will be applied at any stage.
Indexing & Vault Storage
Normalized statements will be stored in a version-controlled vault. Indexes will support:
• full-text retrieval
• chronological ordering
• source filtering
• topic/event cross-reference
• optional embedding-based similarity search
The vault ensures immutability of original wording while allowing metadata extensions over time.
Query Engine
The retrieval engine will return only primary-source material. All responses will:
• preserve original wording
• avoid interpretation or rewriting
• include citations and precise source locations
No blended narratives or inferred meaning will be generated.
Output Specification
Query outputs will include:
• an ordered set of direct quotations
• citation metadata
• optional semantic tags for downstream analytics and visualization
The system is intentionally designed to enforce neutrality and maintain source integrity.
The Light Keeper aims to provide researchers, archivists, and analysts with a reliable way to work with testimony-based datasets that span long timeframes and diverse media types. As the design progresses, the focus remains on ensuring transparent retrieval, consistent metadata handling, and clean, verifiable data structures for future research applications.