Chronos V2 - LLM event doc architecture

MAY 12, 2025

Here’s a holistic, production-ready architecture that delivers “complete semantic understanding” of your per-minute photos (plus location and mobile-use flag), and supports natural-language querying with rich answers (text + related images + metadata).

A. Why Pure Sub-Event Segmentation Falls Short

Memento 4.0’s two-stage pipeline (CLIP similarity → time-gap merge) works well for coarse lifelogging, but:

It can split long activities (dinner + TV) into multiple events or miss short but semantically important interactions (a colleague dropping by) .
It treats all modality changes equally—even trivial camera-covers produce sub-events.

To get complete semantic understanding, we need to move beyond fixed segmentation and treat the stream as a continuous, multi-modal corpus that we can query at any granularity.

B. End-to-End Semantic-First Architecture

flowchart LR
  A[Raw Ingestion]
  B[Feature Extraction & Captioning]
  C[Continuous Indexing]
  D[Query & RAG]
  E[Frontend UI]

  A --> B --> C --> D --> E

1. Raw Ingestion

Every minute: photo, GPS, is_using_mobile flag.
Stream into a broker (Kafka, or SpacetimeDB changefeed).

2. Feature Extraction & Captioning

Occlusion & Quality Filter: drop or flag truly “garbage” frames.
Image Captioning: BLIP-2 or MolMoe-1B on every frame.
Multi-Modal Embeddings:
- Visual (CLIP)
- Text (embed captions)
- Spatial (lat/lon → normalized XY)
- Usage (binary flag)
Rolling Context Window (e.g. ±5 frames): optionally re-caption “bad” frames via LLM prompt + context—ensures seamless narratives across occlusions.

3. Continuous Indexing (No Fixed Sub-Events)

Rather than rigid sub-event chunks, treat each minute (or small N-minute window) as a “document” in your vector store, augmented with its metadata.

Vector Store (Pinecone, FAISS): index each time-window embedding.
Metadata Store (SpacetimeDB or TimescaleDB): store timestamp, location, usage flag, key-frame pointers.
Continuous Aggregates: materialized views for fast time- or location-based filtering.

This departs from Section 4.1’s bottom-up clustering and instead relies on flexible retrieval over fine-grained chunks.

4. Query & RAG Layer

Natural-language query arrives.
Metadata filter: e.g. WHERE ts BETWEEN … AND location WITHIN … AND is_using_mobile = X.
Semantic filter: embed query with same encoder → ANN search → top-k candidate windows.
RAG Agents:
- Summary Agent: stitches contiguous windows’ captions into a coherent answer narrative.
- Factual Agent: directly extracts timestamps, locations, or usage flags from metadata.
- Image Retriever: returns the key-frames (or small gallery) corresponding to the chosen windows.

This unifies Sections 4.3–4.4: we no longer need “sub-event → event” clustering to get coherent answers .

5. Frontend UI

Chat interface: shows the LLM’s answer text.
Gallery panel: displays the returned key-frames with hoverable metadata (timestamp, GPS, mobile-use).
Timeline slider & map: lets the user restrict query scope interactively.

C. Why This Is Better

Granular Retrieval: you never “miss” a one-minute interaction, and you can always group adjacent minutes dynamically in your answer.
Semantic Coherence via RAG: the LLM glues together whatever windows it needs, rather than depending on brittle precomputed event boundaries.
Simplicity & Real-Time: you only need to ingest, embed, and index each minute’s data—no batch clustering or hyperparameter tuning for sub-events.
Rich Answers: easily surface both narrative and the exact photos / metadata underpinning it.

Bottom line: skip hard-coded sub-event creation altogether. Instead, index every minute as a semantic doc, and let your RAG/query layer weave them into the right “event” on demand—delivering full semantic understanding, precise image recall, and real-time responsiveness.

BACK TO BLOGS

Comments

Comments functionality coming soon...