Features

Everything you need to fix retrieval at the source.

CHUNKZA is a complete workspace for the data layer of RAG — from ingestion to retrieval replay, every stage is inspectable.

Diagnostic panel

One panel for the whole chunking lifecycle.

Stop debugging retrieval in production. Inspect every boundary, every metadata tag, every embedding cluster — before you ship a single chunk to your vector store.

  • Live boundary preview with token budgets
  • Per-chunk metadata and source provenance
  • Embedding clusters, duplicates, and outlier flags
  • Retrieval replay with score breakdowns
chunkza · diagnosticlive
v1.4

Source · paper.pdf

Chunks · 5

Parent · §2.1parent

318 tokens

Child · semanticchild

92 tokens

Child · semanticchild

74 tokens

Child · overlapoverlap

41 tokens

Layout · tablelayout

56 tokens

Recall@5

91.4%+18.2

Boundary acc.

96.7%+11.4

Latency

42ms-3.1x
Ingest & parse

Bring any document. Keep its structure.

Multi-format ingestion

PDF, DOCX, PPTX, Markdown, HTML, Notion exports, Confluence, and raw text — all parsed into a unified structural representation.

Layout-aware segmentation

Headings, sub-headings, lists, tables, captions, and code blocks become first-class boundaries. No more slicing a table in half.

Schema-aware tables

Table headers propagate to every row chunk, so numeric and structured retrieval stays coherent and queryable.

Split & link

Chunk by meaning, not by character count.

Semantic boundary detection

A trained model predicts natural topic shifts inside long passages. Chunks end where meaning ends.

Parent-child chunking

Retrieve on tight, cheap child chunks. Return the rich parent context to the model. Recall and precision in one pipeline.

Overlap & window control

Configurable overlap, sliding windows, and max-context budgets — all policy-driven and per-section overridable.

Visualize & diagnose

Make the invisible visible.

Chunk boundary preview

See exactly where every strategy cuts, in context, with token counts and metadata attached to each chunk.

Embedding distribution map

Project chunk embeddings into 2D space. Spot clusters, outliers, and silent duplicates that drag down retrieval.

Strategy diff

Compare any two chunking versions side by side. See which boundaries moved and what it does to recall.

Retrieve & iterate

Close the loop with real queries.

Retrieval replay

Replay real queries against any chunking version. Watch which chunks surfaced, in what order, with what score.

Metadata injection

Auto-tag every chunk with section path, source URI, page, and custom schema fields for query-aware retrieval.

Export anywhere

Push chunked corpora to Pinecone, Weaviate, Qdrant, pgvector, or your own store with one command.

Start chunking smarter today

Replace crude character counts with layout-aware, semantically-boundary-aware chunking. See your retrieval quality rise from the source.