A clean, observable pipeline. Four stages, fully inspectable.
CHUNKZA replaces the black box of ad-hoc chunking with a pipeline you can see, diff, and replay. Here's exactly what happens to your documents.
Ingest
Bring your corpora in any form.
Connect a source — a folder, a bucket, a Notion workspace, a Confluence space — or upload files directly. CHUNKZA normalizes PDF, DOCX, PPTX, Markdown, HTML, and plain text into a single structural representation, preserving headings, tables, lists, and captions.
- PDF, DOCX, PPTX, MD, HTML, Notion, Confluence
- OCR pass for scanned documents
- Source URI and provenance preserved
- Incremental sync for live sources
Parse & split
Chunk by structure, then by meaning.
Layout-aware segmentation identifies structural boundaries first. A semantic boundary model then refines the splits inside long passages, predicting where topics shift. Parent and child chunks are linked automatically, with metadata injected at every level.
- Layout-aware structural segmentation
- Semantic boundary detection on long passages
- Parent-child linking with shared metadata
- Per-section policy overrides
Visualize
Inspect every boundary before you ship.
Open the diagnostic panel to preview chunk boundaries in context, inspect metadata on each chunk, and project embeddings into 2D to spot clusters, outliers, and duplicates. Diff any two strategies side by side and watch which boundaries move.
- Live boundary preview with token budgets
- Embedding distribution map
- Strategy diff with recall impact
- Metadata and schema validation
Retrieve
Export, replay, and measure.
Push the chunked corpus to your vector store in one command. Replay real queries against any chunking version to see which chunks surfaced, in what order, with what score. Iterate the policy, re-export, and watch retrieval quality climb.
- One-command export to Pinecone, Weaviate, Qdrant, pgvector
- Retrieval replay with score breakdowns
- Recall@k and context-token dashboards
- Versioned chunking policies, fully reproducible
See it on your own documents
Bring a sample corpus. We'll run it through the pipeline and show you the diagnostic panel live.