Chronos V2 - LLM event doc architecture

6

Here’s a holistic, production-ready architecture that delivers “complete semantic understanding” of your per-minute photos (plus location and mobile-use flag), and supports natural-language querying with rich answers (text + related images + metadata).


A. Why Pure Sub-Event Segmentation Falls Short

Memento 4.0’s two-stage pipeline (CLIP similarity → time-gap merge) works well for coarse lifelogging, but:

  • It can split long activities (dinner + TV) into multiple events or miss short but semantically important interactions (a colleague dropping by) .
  • It treats all modality changes equally—even trivial camera-covers produce sub-events.

To get complete semantic understanding, we need to move beyond fixed segmentation and treat the stream as a continuous, multi-modal corpus that we can query at any granularity.


B. End-to-End Semantic-First Architecture

flowchart LR A[Raw Ingestion] B[Feature Extraction & Captioning] C[Continuous Indexing] D[Query & RAG] E[Frontend UI] A --> B --> C --> D --> E

1. Raw Ingestion

  • Every minute: photo, GPS, is_using_mobile flag.
  • Stream into a broker (Kafka, or SpacetimeDB changefeed).

2. Feature Extraction & Captioning

  • Occlusion & Quality Filter: drop or flag truly “garbage” frames.

  • Image Captioning: BLIP-2 or MolMoe-1B on every frame.

  • Multi-Modal Embeddings:

    • Visual (CLIP)
    • Text (embed captions)
    • Spatial (lat/lon → normalized XY)
    • Usage (binary flag)
  • Rolling Context Window (e.g. ±5 frames): optionally re-caption “bad” frames via LLM prompt + context—ensures seamless narratives across occlusions.

3. Continuous Indexing (No Fixed Sub-Events)

Rather than rigid sub-event chunks, treat each minute (or small N-minute window) as a “document” in your vector store, augmented with its metadata.

  1. Vector Store (Pinecone, FAISS): index each time-window embedding.
  2. Metadata Store (SpacetimeDB or TimescaleDB): store timestamp, location, usage flag, key-frame pointers.
  3. Continuous Aggregates: materialized views for fast time- or location-based filtering.

This departs from Section 4.1’s bottom-up clustering and instead relies on flexible retrieval over fine-grained chunks.

4. Query & RAG Layer

  1. Natural-language query arrives.

  2. Metadata filter: e.g. WHERE ts BETWEEN … AND location WITHIN … AND is_using_mobile = X.

  3. Semantic filter: embed query with same encoder → ANN search → top-k candidate windows.

  4. RAG Agents:

    • Summary Agent: stitches contiguous windows’ captions into a coherent answer narrative.
    • Factual Agent: directly extracts timestamps, locations, or usage flags from metadata.
    • Image Retriever: returns the key-frames (or small gallery) corresponding to the chosen windows.

This unifies Sections 4.3–4.4: we no longer need “sub-event → event” clustering to get coherent answers .

5. Frontend UI

  • Chat interface: shows the LLM’s answer text.
  • Gallery panel: displays the returned key-frames with hoverable metadata (timestamp, GPS, mobile-use).
  • Timeline slider & map: lets the user restrict query scope interactively.

C. Why This Is Better

  • Granular Retrieval: you never “miss” a one-minute interaction, and you can always group adjacent minutes dynamically in your answer.
  • Semantic Coherence via RAG: the LLM glues together whatever windows it needs, rather than depending on brittle precomputed event boundaries.
  • Simplicity & Real-Time: you only need to ingest, embed, and index each minute’s data—no batch clustering or hyperparameter tuning for sub-events.
  • Rich Answers: easily surface both narrative and the exact photos / metadata underpinning it.

Bottom line: skip hard-coded sub-event creation altogether. Instead, index every minute as a semantic doc, and let your RAG/query layer weave them into the right “event” on demand—delivering full semantic understanding, precise image recall, and real-time responsiveness.

Comments

Comments functionality coming soon...