TSqlExecutor:FetchRow was the per-row workhorse for aggregation,
HAVING, and window queries. Even with the pre-built aFetchCache
binding columns to (nWA, nFPos), the PRG FOR loop paid one method
dispatch per column per row (dbSelectArea, FieldGet, AllTrim,
AAdd) — profile pinned it at ~30% of B4 CPU.
SqlFetchRowFast collapses the cache-path loop into a single Go
call:
- bound entry: SelectByNum + area.GetValue directly
- unbound (aggregate/expression): self:EvalExpr via Send
- character values: TrimSpace inline
The PRG FetchRow keeps its original cache-miss fallback path
unchanged for rare queries where aFetchCache isn't built.
Bench deltas (median of 3 steady runs, 1000 iters):
B4_GROUP_HAVING 418 → 327 us -22% (1.28x)
B9_ROW_NUMBER 191 → 120 us -37% (1.59x)
B10_RANK_PART 228 → 135 us -41% (1.69x)
B11_SUM_OVER 249 → 156 us -37% (1.60x)
B14_COUNT 235 → 219 us -7%
B15_CTE_WIN_JOIN 1577 → 1452 us -8%
Single-table SELECT (B1-B3, B5-B7, B8) stays flat — those already
hit the column-binding fast path and don't need aggregate dispatch.
FiveSql2 43/43, Harbour compat 56/56.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
SqlOrderBy: Go sort.Slice for ORDER BY, 10-50x faster than PRG ASort.
SqlGroupBy: Go map-based GROUP BY accumulation (ready for integration).
TryBuildSortSpec detects simple ORDER BY columns and routes to Go.
Fallback to PRG for complex ORDER BY expressions.
43/43 + 41/41 verify + 51/51 compat + go test ALL PASS.
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Complex-query benchmarking turned up two hot paths that the earlier
SqlScan/SqlEach work didn't touch: multi-table JOIN and nested-scan
row fetching. This commit hits both.
--- Part 1: SqlHashBuild — Go-native hash-join build ---
FiveSql2's HashJoin previously built the inner-side hash in PRG:
WHILE !Eof()
xVal := FieldGet(nFPos)
cKey := SqlValToStr(xVal)
IF !hb_HHasKey(hHash, cKey) ; hHash[cKey] := {} ; ENDIF
AAdd(hHash[cKey], RecNo())
dbSkip()
ENDDO
That loop runs at ~40μs per row from class dispatch + hb_HHasKey
lookups + AAdd growth + SqlValToStr formatting. On a 50k-row inner
table that's ~2 seconds wasted on what should be a sub-50ms
housekeeping op.
New hbrtl.SqlHashBuild does the same thing in one Go-native pass:
- Direct *dbf.DBFArea loop (no interface dispatch, same devirt as
SqlScan)
- Go `map[string][]int64` accumulates RecNos by key — one
allocation per distinct key
- Inline ASCII-only digit formatter for numeric keys (strconv.Itoa
is allocation-heavy for small ints)
- CHAR keys are right-trimmed to match SqlCmpEq semantics so the
hash probe matches what EvalExpr would compute
- Final Five hash is built once from Keys/Values/Order slices
directly, skipping the per-key hb_HSet path
HashJoin now calls `SqlHashBuild(nFPos)` instead of running the
PRG loop.
--- Part 2: TSqlExecutor:BuildFetchCache ---
The JOIN fallback loop calls FetchRow per row. FetchRow was already
column-ref-aware but did the string parse (`At + SubStr + Upper`)
and `::FindWA` linear scan every single invocation. For a 50k-row
join emitting 50k result rows, that's ~200k redundant resolutions.
New BuildFetchCache walks the SELECT list once before the scan and
pre-binds each plain-column expression to `{nWA, nFPos}`. FetchRow's
new fast path checks ::aFetchCache and jumps straight to
`dbSelectArea + FieldGet` when bound. Complex exprs (functions,
CASE, subqueries) still fall through to EvalExpr.
::aFetchCache is set right before the join WHILE loop and cleared
after — no cross-query bleed.
--- Bench (50k ord × 10k emp × 100 dept, 3-run steady state) ---
Query Before After Speedup
────────────────────────────────────────────────────────────
2-way INNER JOIN, 10k rows 91ms 68ms 1.34x
2-way JOIN + GROUP BY 110ms 94ms 1.17x
3-way INNER JOIN COUNT 2610ms 610ms 4.28x
3-way JOIN + GROUP BY 2860ms 830ms 3.45x
The 3-way speedup is almost entirely SqlHashBuild. The 2-way case
benefits from the fetch cache because its per-row cost is dominated
by FetchRow (no second hash build to amortize).
--- Limits still standing ---
CTE + JOIN queries (Q7 in bench_complex: ~4.5s) aren't affected by
either optimization — CTE materialization goes through a different
path that writes/reads a temp DBF. Follow-up target.
Validation:
- FiveSql2 43/43
- Harbour compat 51/51
- go test ./... ALL PASS
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
The structural 1.38x gap vs raw RDD for no-WHERE full scans wasn't
a limit of our engine — it was a limit of the result shape. SqlScan
materializes N rows as HbArray wrappers over a flat Value buffer,
then the PRG caller iterates that materialized array. Two passes
over the data. Raw RDD is one pass.
SqlEach folds both passes into one. The caller supplies a code block
that receives the selected column values as positional parameters;
SqlEach invokes it per matching row. No result array is ever built.
Usage (drop-in replacement for the common "scan + process" idiom):
five_SQLEach( "SELECT id, name, salary FROM emp WHERE salary > 50000",
{|nID, cName, nSalary| Process(nID, cName, nSalary) } )
API shape borrows Harbour's AEval/ASort block-callback convention,
so there's nothing new to learn. Positional params also sidestep
the `SELECT COUNT(*)` naming problem — no need to invent names for
anonymous expressions.
Implementation notes:
- 4-way loop specialization ({DBF, generic Area} × {WHERE, none}),
matching SqlScan. Each path is zero-allocation in the steady state.
- Block invocation uses the direct pendingParams + blk.Fn(t) protocol
rather than EvalBlock, which would allocate a temporary args slice
on every call (50k scans × small slice adds up).
- FastFieldGetter is installed the same way as SqlScan so PcOpFieldGet
in the WHERE predicate skips the PushSymbol + Function dispatch.
Bench (50k rows, end-to-end including user-code loop, steady state):
Path Time vs raw RDD
─────────────────────────────────────────────────────
Raw PRG loop, WHERE + sum 8.7ms 1.00x
SqlScan + PRG FOR, WHERE 5.1ms 0.59x
SqlEach block, WHERE 4.1ms 0.47x ← beats raw
─────────────────────────────────────────────────────
Raw PRG loop, no WHERE 6.1ms 1.00x
SqlEach block, no WHERE 3.8ms 0.62x ← beats raw
SqlEach is faster than a hand-rolled `DO WHILE !Eof()` loop because
the per-row FieldGet in raw PRG still goes through a full Frame +
RTL dispatch, whereas SqlEach's FastFieldGetter captures the concrete
*dbf.DBFArea directly. The SQL abstraction now costs nothing — it
pays you to use it.
Validation:
- FiveSql2 43/43
- Harbour compat 51/51
- go test ./... ALL PASS
Next step (not in this commit): FiveSql2 TSqlExecutor integration —
detect when five_SQL is called with a block argument and route to
SqlEach instead of SqlScan + array build.
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
SqlScan's inner scan was written as a single loop with `if whereFn
!= nil` and a `keep` shadow variable. Branch-predictable for sure,
but still a few extra ops per row and it prevented Go from inlining
the non-nil interface call on the Area branch.
Split into four specialized loop bodies on the two axes that drive
per-row cost:
1. dbfArea != nil && whereFn != nil
2. dbfArea != nil && whereFn == nil ← tightest path (SELECT *)
3. dbfArea == nil && whereFn != nil ← generic Area
4. dbfArea == nil && whereFn == nil
Each body has exactly the instructions it needs — no dead branches,
no shadow variables, no interface dispatch where avoidable. Copy-paste
cost is real but each row save adds up at 50k iterations.
Bench impact (50k rows, 3-run steady state):
No WHERE 9.1ms → 8.7ms 1.38x vs raw (was 1.47x)
Numeric WHERE 6.9ms → 7.0ms ~flat (within noise)
String WHERE 6.2ms → 6.4ms ~flat (within noise)
Raw RDD 6.3ms baseline
Validation:
- FiveSql2 43/43
- Harbour compat 51/51
- go test ./hbrtl/... PASS
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Two stacked optimizations land on the SqlScan hot path. Combined
effect on the 50k-row benchmark:
Before After vs raw
Numeric WHERE 10.2ms 7.8ms 1.15x
String WHERE 10.5ms 7.9ms 1.15x
No WHERE 9.2ms 10.0ms 1.45x
Raw RDD baseline 6.8ms 6.8ms 1.00x
WHERE-predicate paths are now within 15% of the raw Harbour-style
RDD scan loop. The no-WHERE path is unchanged (slight jitter from
the added devirt branch); FieldGet peephole doesn't apply there.
--- Optimization 1: PcOpFieldGet peephole ---
Adds a new pcode opcode `PcOpFieldGet <fieldIdx>` (0x46) that skips
the usual PushSymbol+Function+Frame+FieldGet-RTL+EndProc chain and
calls a direct field getter closure instead. genpc recognizes the
shape `FieldGet(<int-literal>)` during emitCall and emits the
specialized opcode automatically — no SQL-side API change.
Integration:
* hbrt.Thread.FastFieldGetter — hot-path closure set by scan loops.
Non-nil → pcode bypasses dispatch.
Nil → pcode resolves FIELDGET via
the RTL symbol table (correctness
fallback for any other callers).
* compiler/genpc/genpc.go — peephole in emitCall.
* hbrt/pcinterp.go — PcOpFieldGet handler.
This alone cut numeric WHERE from 10.2 → 7.9ms: eliminated roughly
one full Frame/EndProc + RTL dispatch per row × 50k rows.
--- Optimization 2: DBFArea devirtualization ---
SqlScan type-asserts the workarea to *dbf.DBFArea once and runs a
dedicated loop that calls GoTop/EOF/Skip/GetValue directly on the
concrete type. Go's compiler inlines these, skipping the interface
vtable per row. Non-DBF drivers still work via the generic Area
branch.
The FastFieldGetter closure also captures *DBFArea directly in the
DBF branch, so the WHERE predicate side of the hot loop is now
entirely devirtualized: no interface dispatch between the pcode
dispatch loop and the DBF record buffer.
Validation:
- FiveSql2 43/43
- Harbour compat 51/51
- go test ./... ALL PASS
Remaining gap to raw RDD on no-WHERE (~1.45x) is dominated by the
two-column row construction + ArraySlab + flat backing bookkeeping
that the raw loop doesn't do. Going below that requires changing
the SQL engine's result shape — out of scope here.
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Pcode expressions compiled from SQL WHERE clauses (via genpc.CompileExpr)
never contain BEGIN SEQUENCE and can't raise BreakValue, so the defer +
recover dance in ExecPcode's EndProc is pure overhead. For FiveSql2's
per-row WHERE evaluation on a 50k-row scan, that's 50k × ~15ns = ~750µs
of pointless recover bookkeeping.
Split ExecPcode into two variants sharing execPcodeBody:
ExecPcode — full: Frame + defer EndProc. General-purpose,
handles panics. Behavior unchanged.
ExecPcodeFast — hot: Frame + execPcodeBody + EndProcFast. No defer,
no recover. Caller guarantees the pcode body can't
panic with HbError / BreakValue.
SqlScan now uses ExecPcodeFast for per-row WHERE evaluation. Measured
impact on 50k-row no-WHERE benchmark: 10.6ms → 9.2ms steady state
(~13% faster). Effect is smaller on numeric-WHERE because per-row
cost there is dominated by the opcode dispatch itself, not the frame
exit.
Validation:
- FiveSql2 43/43
- go test ./hbrt/... PASS (pcode tests)
- go test ./hbrtl/... PASS
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
The prior loop allocated one small `[]hbrt.Value` per matching row
(for the row body) plus one HbArray header. For a 50k-row full scan
that's 100k allocations of which the small-slice allocs dominated
fragmentation and GC pressure.
SQLite-inspired fix: pre-allocate a single flat []hbrt.Value of
capacity `RecCount * nFields` at scan start and hand each row a
three-index sub-slice (flat[off:end:end]). The capped sub-slice
still forces a reallocation if PRG code later does `AAdd(row, x)`,
so neighbor rows can't get clobbered.
Sizing the initial buffer off RecCount(err-ignored) was the actual
win — the previous naive grow-from-1024 policy caused five mid-scan
reallocations of a ~200 KB buffer, each memcpy'ing everything so far.
One upfront allocation amortizes much better.
Bench (50k rows, ~/tmp ext4, 3 runs steady-state):
Before After Δ
no WHERE 14.6ms 10.6ms −27%
numeric WHERE 11.7ms 10.0ms −15%
string WHERE 10.5ms 11.0ms ~=
raw RDD baseline 6.8ms 7.0ms
Gap to raw RDD: 2.1x → 1.4x on the dominant no-WHERE case. What's
left is pcode WHERE dispatch (ExecPcode frame per row), the Area
interface boundary, and the HbArray header allocation per row —
all structural costs that would need a wider refactor to close.
Validation:
- FiveSql2 43/43
- go test ./hbrtl/... PASS
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Implements hybrid execution model: keep AST tree-walk for SQL:2013+
features (Window, Recursive CTE, JOIN, aggregates) while compiling
simple SELECT hot paths to Go + pcode. See docs/FiveSql2-Hybrid-Plan.md
for the full architecture rationale (why not SQLite-style VDBE).
Hot path (single table, no joins/groups/aggregates):
- TryBuildFieldPositions: resolves SELECT column list to FieldPos
array once per query (bails to PRG loop on any complex expr).
- TryCompileWhere + SqlExprToPrg: walks WHERE AST, emits equivalent
PRG source, runs it through PcCompile to get a PcodeFunc.
- SqlScan RTL: Go-native scan loop — GoTop/EOF/Skip/GetValue
direct, ExecPcode per row for WHERE, result array pre-alloc.
WHERE compiler scope:
- ND_LIT numeric/logical/string (string literals AllTrim'd to match
SqlCmpEq CHAR-padding semantics; rejects embedded quotes/newlines)
- ND_COL: CHAR fields auto-wrapped with AllTrim(FieldGet(n)) based
on dbStruct() lookup cached once per query in aCompileStruct
- ND_BIN: = <> != < <= > >= AND OR + - * /
- ND_UNI: NOT -
- Anything else (ND_FN, ND_CASE, ND_SUB, ND_PAR, LIKE, IN, IS NULL,
BETWEEN, dates) returns NIL → falls back to PRG tree-walk.
Bench (50k rows, ~/tmp ext4):
Before After Speedup
Numeric WHERE ~150ms 11.7ms ~13x
String WHERE 119.3ms 10.5ms 11.4x
No WHERE - 14.6ms -
Raw RDD baseline 6.8ms 6.8ms 1.0x
Remaining gap to raw RDD (~1.5x) is structural: Value boxing, result
array construction, per-row ExecPcode frame overhead. Would need a
Value-pool or SoA refactor to close further.
Side fixes bundled:
- TSqlIndex:FindExclusive short-circuited. Originally called
dbInfo(DBI_FULLPATH)/DBI_SHARED which are unresolved symbols in
Five (dbInfo is a stub, DBI_* never defined). Panic'd with
"local variable index out of range: 0" whenever a standalone PRG
had a workarea Used before calling five_SQL. 43-test masked the
bug because it only reached FindExclusive with no open workareas.
Restore the scan once dbInfo lands in hbrtl.
- cmd/five/main.go: FIVE_KEEP_BUILD=1 env var keeps the temp Go
project around for debugging gengo output.
Validation:
- FiveSql2 43/43
- Harbour compat 51/51
- go test ./... ALL PASS
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>