One /v1/extract call decomposes into three Bedrock-chargeable
pieces. Monolith's VLM pipeline is by far the dominant cost; selfservice's
additions (recall, log, distillation) are a ~5% overhead.
| Component | Model | Tokens (typical) | Cost |
|---|---|---|---|
| Monolith VLM pipeline (7 stages) | Haiku 4.5 + Sonnet 4.6 | 4k–12k in / 2k–4k out | ~$0.008 – $0.020 |
Insight distillation (auto_memory=true) | Haiku 4.5 | ~600 in / ~150 out | ~$0.0003 |
| Memory recall (keyword-scored) | none (deterministic) | 0 | $0 |
| Event log + memory append | none (disk JSON) | 0 | $0 |
| Total per extraction | ~$0.008 – $0.020 | ||
| Component | Typical |
|---|---|
| Recall (keyword-scored memory search) | < 50ms |
| Monolith extraction (1-page PDF) | 8 – 12s |
| Monolith extraction (dense multi-page) | 20 – 60s |
| Insight distillation (Haiku) | 1 – 2s |
| Memory write (disk) | < 20ms |
| Resource | Tier | Monthly |
|---|---|---|
| EC2 t3.small (2 vCPU, 2GB) | on-demand, eu-west-3 | ~$15 |
| EBS gp3 root volume | 20GB | ~$2 |
| Route53 hosted zone entry | 1 A record | $0 (zone already billed) |
| Let's Encrypt SSL | free | $0 |
| Data transfer | low (files proxy-only) | < $1 |
| Baseline infrastructure | ~$18 / month | |
t3.small is comfortably over-provisioned: FastAPI + gunicorn 2-worker +
thin I/O leaves most CPU/memory idle. The bottleneck is always the upstream
Bedrock latency, not the box. A single instance sustains >100 concurrent
/v1/extract calls before connection-pool pressure appears.
auto_memory=true call. If volume scales, batch or debounce
(distill every N-th extraction rather than every one).At $0.012 / extraction average and ~$18/month fixed, break-even vs per-user SaaS is at ~1500 extractions/month across all users. For an addin shipping to a 10-operator glass ops team doing 15 extractions/day each, that's covered by morning 2.