Monce Selfservice v0.1.0

Memory-augmented extraction for Outlook / email workflows.

Home UI Dashboard Paper Architecture Economics Add-in API

Memory-Augmented Extraction: A Concierge Sandwich Around a VLM Pipeline

Charles Dana, Monce SAS — April 2026

Abstract

Selfservice is a thin memory layer that sits around an existing VLM extraction pipeline. Every extraction mutates a per-user memory store; every subsequent extraction reads it back as context. The architecture is intentionally a mix: the extraction engine is monolith (/summit/extract), the memory model is concierge (per-user memories.json), and the distillation step is Haiku 4.5. What's new is the reflex loop that chains them.

1. The Three-Layer Sandwich

A sandwich with three slices instead of two. Input is a file and an email context (subject + body). Output is structured data plus distilled insight.

Recall — before extraction, selfservice scores the user's memory against the current email subject and returns the most relevant prior memories. These are surfaced in the response as prior_memories so the addin can display them.
Extract — selfservice forwards file + context to monolith/summit/extract. Monolith runs its 7-stage VLM pipeline unchanged — selfservice adds zero latency.
Distill — when auto_memory=true, a Haiku call receives the compacted extraction result, the email context, and the relevant prior memories. It returns 1–3 bullets ≤ 110 chars each, focused on surprising or decision-relevant patterns. Bullets are written back as memory entries tagged insight.

2. Why Memory Is Per-User (Not Per-Factory)

Concierge tracks factory-level patterns (top clients per tenant, glass type frequency, synonym recommendations). Selfservice tracks how an individual operator behaves: the shorthand they type in emails, the factories they route to, the corrections they make on the same supplier every week. This information is useless at factory-level aggregation and highly valuable when threaded through a single operator's sequence of extractions.

Storage is isolated under data/users/{user_id}/. There is no cross-user chat, no cross-user recall, no cross-user digest. The user_id is an 8-character opaque token persisted client-side in the Outlook add-in's roaming settings.

3. The Reflex Loop

When Outlook(user_id=..., auto_memory=True) is instantiated, every call to extract_email() fires:

recall(subject)
  → extract(file, context=body)
  → distill(result, prior=recall_output)
  → remember(bullets)

The Haiku pass is cheap (~$0.0003/call) and short (~1-2s), which makes the reflex acceptable as an always-on. When auto_memory=false, recall and logging still happen; only the Haiku distillation is skipped.

4. Chat Over Memory

/v1/chat is Sonnet 4.6 grounded strictly on the requesting user's memories, extraction history, and recent conversations. The system prompt pins the model to its user scope. Topics outside the user's data yield "no record" rather than hallucination.

5. What This Replaces

Monolith's own addin posted to /extract directly and had no memory. Concierge tracked the tenant but not the operator. Selfservice fills the gap between the two. The addin repoints its base URL and inherits memory for free.

6. Limitations and Open Questions

The insight distillation relies on Haiku's JSON discipline — when the model returns prose, _fallback_insights produces a simpler deterministic bullet.
Memory is file-backed JSON. At >1000 memories per user, keyword scoring becomes noticeable (~ms). The roadmap is to add a digest layer (pre-aggregated summaries, concierge-style) before considering a vector store.
Forgetting is substring-based, which can over-delete. A /v1/forget_by_id is planned.