Files
five/rag/README.md
CharlesKWON b8a19bd350 docs(rag): note bluge full-text index (built via KWONDoc source)
Corpus indexed into KWONDoc's bluge index (~/.kwondoc/search-index,
category five-rag) so bluge_search surfaces it; README documents the
re-index command (cmd/ragindex online upsert, doesn't wipe other docs).

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
2026-06-15 16:35:00 +09:00

53 lines
2.9 KiB
Markdown
Raw Blame History

This file contains ambiguous Unicode characters
This file contains Unicode characters that might be confused with other characters. If you think that this is intentional, you can safely ignore this warning. Use the Escape button to reveal them.
# Five RAG — knowledge corpus for LLM agents writing Five
A compact, retrieval-ready knowledge base that lets an LLM read and write **Five**
(xBase/Harbour → Go) code correctly without prior training on it. This is the practical
form of "give the model the grammar via RAG": grammar + RTL surface + real idioms +
the long-tail gotchas.
## Why this exists
Five is token-dense, so the corpus needed to *teach* a model is small and cheap to inject
— a dense language is cheaper to RAG than a verbose one. Grammar/RTL retrieval closes most
of the gap; the accumulating **gotchas** file closes the semantic long tail.
## Contents
| File | What it covers |
|------|----------------|
| `01-overview.md` | What Five is, design priorities, the two runtimes, compile model |
| `02-syntax.md` | Declarations, literals, operators, control flow, code blocks |
| `03-rtl-catalog.md` | Runtime-library functions (strings, array, hash, JSON, date, regex, charset, …) |
| `04-idioms.md` | Web/worker patterns: HTTP endpoint, routing, Postgres, job queue, LLM, build/deploy |
| `05-gotchas.md` | Non-obvious traps + fixes (the highest-signal file) |
| `06-security.md` | Web security patterns: authz, sessions, password hashing, XSS, CSP, uploads |
| `INDEX.md` | Retrieval manifest (doc → keywords + one-line) |
Every file has YAML frontmatter (`doc`, `title`, `keywords`, `summary`) for ranking.
## How to consume
- **Direct context injection (simplest):** for a small/medium task, paste the relevant
doc(s). For broad work, `01`+`02`+`05` fit easily; pull `03`/`04` sections as needed.
- **Keyword retrieval (built-in):** run `./search.sh <terms>` — a dependency-free
ripgrep/grep ranker over the corpus (frontmatter `keywords` weighted ×3 + body),
printing ranked docs with the matching `##` section headers. No index to build.
e.g. `./search.sh session token csprng``06-security.md §2`.
- **bluge full-text index (KWONDoc):** this corpus is indexed into KWONDoc's bluge
index (`~/.kwondoc/search-index`, category `five-rag`) so `bluge_search` finds it.
Re-index after edits: `cd ~/kwondoc && go run ./cmd/ragindex <abs path to rag> five-rag`
(online upsert — keyed by file path, does not wipe other docs).
- `INDEX.md` is the hand-curated routing table; an embeddings index can ingest the same `.md`.
- **Embedding RAG:** chunk by `##` headers (each section is self-contained). Frontmatter
`summary` makes a good chunk preamble.
Suggested system-prompt pointer: *"When writing Five (.prg) code, consult the Five RAG at
`fivedev/five/rag/` — especially `05-gotchas.md` — and prefer patterns from `04-idioms.md`."*
## Maintenance
- Keep `03-rtl-catalog.md` honest against `hbrtl/register.go` (names are authoritative;
rare signatures may drift).
- **Append every new trap to `05-gotchas.md`.** That file is the compounding asset.
- Grammar truth: `compiler/{lexer,parser,ast}`. Idiom truth: the `solmade` app.