fix(pgserver): Layer 5 — per-path mmap-gen registry + getWA torn-read

Closes the Go-panic class of multi-session concurrency bugs and introduces an explicit cross-area mmap invalidation channel. 1. getWA waCache torn-read (root cause of panics) hbrtl/rdd.go cached the most recent `interface{} → *WAM` type assertion in a process-global struct of two `interface{}`- shaped fields. Each pgserver connection's NewThread gets its own WAM, so the cache missed on every call and immediately re-wrote two shared, unsynchronised fields. Go's `interface{}` is two words; concurrent write + read produced torn pointer values, with the result that goroutine A could observe goroutine B's WAM as its own. That mis-attribution surfaced as: - `concurrent map writes` panic at WorkAreaManager.Close (workarea.go:95): two goroutines genuinely modifying the SAME wam.aliases map. - `concurrent map writes` panic at DBFArea.FieldPosCache (dbf.go:439): two goroutines lazy-initing the SAME fieldPosMap. Drop the cache. The type assertion is ~ns; not worth a process-global shared slot. If perf matters again, replace with a sync.Map keyed by thread pointer, not a single struct. 2. Per-path mmap generation registry (hbrdd/dbf/area_registry.go) Each unique on-disk DBF path gets an atomic uint64 generation counter. *DBFArea instances: - On Open: pathGen = pathGenFor(path); pathGenSeen = current. - On Append (shared) / flushRecord: bumpPathGen(path); pathGenSeen = current. - On loadRecord: if pathGenSeen < live counter, bypass mmap fast path for THIS load (use ReadAt) and re-sync seen. Without this, a peer DBFArea's PutValue mutating a record we'd mmap-cached returned stale pre-mutation bytes from our snapshot. The existing length-bound check covered file-grow (`offset > mmap len`) but not byte-level mutation within the snapshot range. The registry covers both. Cheap: read = one atomic.LoadUint64, hit rate is ~100% in the single-writer-many-readers steady state. Verification ------------ Same 3 / 5 / 10-worker pgx-driven concurrency stress harness: pre-Layer-1 baseline: ~60% pass + occasional panic +Layer 1+2: 80% / 50% / panic +Layer 3a (max-merge): 80% / 50% / panic +Layer 4a (per-session 3): 90% / 80% / 50% +Layer 4b (Go atomics): 75-90% / 50-80% / panic (still) +THIS (getWA + mmap-gen): 73% / 67% / 33% — ZERO PANICS The shift "many partial fails, no panics" is what matters for production: a connection seeing stale data is recoverable (rerun the query); a Go-level process crash is not. Remaining correctness flake comes from the in-flight appendBuf interaction when peer Append fires between this connection's Append and flushRecord — that's tractable with a per-connection flush ordering rule, deferred to Layer 6. All six release gates green: go test ./... ✓ FiveSql2 SQL:1999 43/43 ✓ Harbour compat 56/56 ✓ std.ch 17/17 ✓ FRB 7/7 ✓ pgserver integration 6/6 ✓ Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-21 21:43:04 +09:00
parent 5e4a1c5d72
commit 151b628f6c
3 changed files with 135 additions and 17 deletions
--- a/hbrtl/rdd.go
+++ b/hbrtl/rdd.go
@@ -109,24 +109,25 @@ func rtlFieldGet(t *hbrt.Thread) {
 	t.RetValue()
 }

-// getWA returns the WorkAreaManager with cached type assertion.
-var waCache = struct {
-	iface interface{}
-	wam   *hbrdd.WorkAreaManager
-}{}
-
+// getWA resolves the WorkAreaManager attached to this thread.
+//
+// The previous version cached the last-seen interface→*WAM pair in
+// a process-global struct to skip the type assertion. That cache
+// was the worst-of-both-worlds under multi-pgserver-connection
+// load: each connection's thread has its own WAM, so the cache
+// missed on every call and immediately re-wrote two shared
+// `interface{}` fields. Go's interface is a two-word value, so a
+// concurrent write+read produced torn pointers — different
+// goroutines saw the WRONG WAM as their own, leading to the
+// FieldPosCache + WAM.aliases "concurrent map writes" panics.
+//
+// The type assertion itself is fast (~ns). Drop the cache; if the
+// micro-bench matters again, replace with a sync/atomic.Pointer
+// or sync.Map keyed by thread, not a single global slot.
 func getWA(t *hbrt.Thread) *hbrdd.WorkAreaManager {
 	if t.WA == nil {
 		return nil
 	}
-	if t.WA == waCache.iface {
-		return waCache.wam
-	}
-	wa, ok := t.WA.(*hbrdd.WorkAreaManager)
-	if !ok {
-		return nil
-	}
-	waCache.iface = t.WA
-	waCache.wam = wa
+	wa, _ := t.WA.(*hbrdd.WorkAreaManager)
 	return wa
 }