Chronos V2 - LLM event doc architecture
Here’s a holistic, production-ready architecture that delivers “complete semantic understanding” of your per-minute photos (plus location and mobile-use flag), and supports natural-language querying with rich answers (text + related images + metadata).
A. Why Pure Sub-Event Segmentation Falls Short
Memento 4.0’s two-stage pipeline (CLIP similarity → time-gap merge) works well for coarse lifelogging, but:
- It can split long activities (dinner + TV) into multiple events or miss short but semantically important interactions (a colleague dropping by) .
- It treats all modality changes equally—even trivial camera-covers produce sub-events.
To get complete semantic understanding, we need to move beyond fixed segmentation and treat the stream as a continuous, multi-modal corpus that we can query at any granularity.
B. End-to-End Semantic-First Architecture
flowchart LR
A[Raw Ingestion]
B[Feature Extraction & Captioning]
C[Continuous Indexing]
D[Query & RAG]
E[Frontend UI]
A --> B --> C --> D --> E
1. Raw Ingestion
- Every minute: photo, GPS, is_using_mobile flag.
- Stream into a broker (Kafka, or SpacetimeDB changefeed).
2. Feature Extraction & Captioning
-
Occlusion & Quality Filter: drop or flag truly “garbage” frames.
-
Image Captioning: BLIP-2 or MolMoe-1B on every frame.
-
Multi-Modal Embeddings:
- Visual (CLIP)
- Text (embed captions)
- Spatial (lat/lon → normalized XY)
- Usage (binary flag)
-
Rolling Context Window (e.g. ±5 frames): optionally re-caption “bad” frames via LLM prompt + context—ensures seamless narratives across occlusions.
3. Continuous Indexing (No Fixed Sub-Events)
Rather than rigid sub-event chunks, treat each minute (or small N-minute window) as a “document” in your vector store, augmented with its metadata.
- Vector Store (Pinecone, FAISS): index each time-window embedding.
- Metadata Store (SpacetimeDB or TimescaleDB): store timestamp, location, usage flag, key-frame pointers.
- Continuous Aggregates: materialized views for fast time- or location-based filtering.
This departs from Section 4.1’s bottom-up clustering and instead relies on flexible retrieval over fine-grained chunks.
4. Query & RAG Layer
-
Natural-language query arrives.
-
Metadata filter: e.g.
WHERE ts BETWEEN … AND location WITHIN … AND is_using_mobile = X
. -
Semantic filter: embed query with same encoder → ANN search → top-k candidate windows.
-
RAG Agents:
- Summary Agent: stitches contiguous windows’ captions into a coherent answer narrative.
- Factual Agent: directly extracts timestamps, locations, or usage flags from metadata.
- Image Retriever: returns the key-frames (or small gallery) corresponding to the chosen windows.
This unifies Sections 4.3–4.4: we no longer need “sub-event → event” clustering to get coherent answers .
5. Frontend UI
- Chat interface: shows the LLM’s answer text.
- Gallery panel: displays the returned key-frames with hoverable metadata (timestamp, GPS, mobile-use).
- Timeline slider & map: lets the user restrict query scope interactively.
C. Why This Is Better
- Granular Retrieval: you never “miss” a one-minute interaction, and you can always group adjacent minutes dynamically in your answer.
- Semantic Coherence via RAG: the LLM glues together whatever windows it needs, rather than depending on brittle precomputed event boundaries.
- Simplicity & Real-Time: you only need to ingest, embed, and index each minute’s data—no batch clustering or hyperparameter tuning for sub-events.
- Rich Answers: easily surface both narrative and the exact photos / metadata underpinning it.
Bottom line: skip hard-coded sub-event creation altogether. Instead, index every minute as a semantic doc, and let your RAG/query layer weave them into the right “event” on demand—delivering full semantic understanding, precise image recall, and real-time responsiveness.
Comments
Comments functionality coming soon...