Files
five/rag/README.md
CharlesKWON 59d7e490b4 docs(rag): quality-gate idiom + dependency-free search.sh
- 04-idioms: document the lint.sh + smoke_test.sh gates and their wiring
  (build.sh gate, pre-commit hook, deploy-time smoke).
- search.sh: ripgrep/grep keyword ranker over the corpus (keywords ×3 +
  body), prints ranked docs + matching section headers — makes the RAG
  searchable with no index to build. README updated.
- Note: KWONDoc bluge MCP/CLI was unavailable here (MCP not connected;
  CLI license-gated), so search.sh delivers the "searchable" goal now; a
  bluge/embeddings index can ingest the same .md files later.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
2026-06-15 16:26:54 +09:00

49 lines
2.6 KiB
Markdown
Raw Blame History

This file contains ambiguous Unicode characters
This file contains Unicode characters that might be confused with other characters. If you think that this is intentional, you can safely ignore this warning. Use the Escape button to reveal them.
# Five RAG — knowledge corpus for LLM agents writing Five
A compact, retrieval-ready knowledge base that lets an LLM read and write **Five**
(xBase/Harbour → Go) code correctly without prior training on it. This is the practical
form of "give the model the grammar via RAG": grammar + RTL surface + real idioms +
the long-tail gotchas.
## Why this exists
Five is token-dense, so the corpus needed to *teach* a model is small and cheap to inject
— a dense language is cheaper to RAG than a verbose one. Grammar/RTL retrieval closes most
of the gap; the accumulating **gotchas** file closes the semantic long tail.
## Contents
| File | What it covers |
|------|----------------|
| `01-overview.md` | What Five is, design priorities, the two runtimes, compile model |
| `02-syntax.md` | Declarations, literals, operators, control flow, code blocks |
| `03-rtl-catalog.md` | Runtime-library functions (strings, array, hash, JSON, date, regex, charset, …) |
| `04-idioms.md` | Web/worker patterns: HTTP endpoint, routing, Postgres, job queue, LLM, build/deploy |
| `05-gotchas.md` | Non-obvious traps + fixes (the highest-signal file) |
| `06-security.md` | Web security patterns: authz, sessions, password hashing, XSS, CSP, uploads |
| `INDEX.md` | Retrieval manifest (doc → keywords + one-line) |
Every file has YAML frontmatter (`doc`, `title`, `keywords`, `summary`) for ranking.
## How to consume
- **Direct context injection (simplest):** for a small/medium task, paste the relevant
doc(s). For broad work, `01`+`02`+`05` fit easily; pull `03`/`04` sections as needed.
- **Keyword retrieval (built-in):** run `./search.sh <terms>` — a dependency-free
ripgrep/grep ranker over the corpus (frontmatter `keywords` weighted ×3 + body),
printing ranked docs with the matching `##` section headers. No index to build.
e.g. `./search.sh session token csprng``06-security.md §2`. `INDEX.md` is the
hand-curated routing table; a bluge/embeddings index can ingest the same `.md` files.
- **Embedding RAG:** chunk by `##` headers (each section is self-contained). Frontmatter
`summary` makes a good chunk preamble.
Suggested system-prompt pointer: *"When writing Five (.prg) code, consult the Five RAG at
`fivedev/five/rag/` — especially `05-gotchas.md` — and prefer patterns from `04-idioms.md`."*
## Maintenance
- Keep `03-rtl-catalog.md` honest against `hbrtl/register.go` (names are authoritative;
rare signatures may drift).
- **Append every new trap to `05-gotchas.md`.** That file is the compounding asset.
- Grammar truth: `compiler/{lexer,parser,ast}`. Idiom truth: the `solmade` app.