fix(pgserver): Layer 5 — per-path mmap-gen registry + getWA torn-read
Closes the Go-panic class of multi-session concurrency bugs and
introduces an explicit cross-area mmap invalidation channel.
1. getWA waCache torn-read (root cause of panics)
hbrtl/rdd.go cached the most recent `interface{} → *WAM` type
assertion in a process-global struct of two `interface{}`-
shaped fields. Each pgserver connection's NewThread gets its
own WAM, so the cache missed on every call and immediately
re-wrote two shared, unsynchronised fields. Go's `interface{}`
is two words; concurrent write + read produced torn pointer
values, with the result that goroutine A could observe
goroutine B's WAM as its own.
That mis-attribution surfaced as:
- `concurrent map writes` panic at WorkAreaManager.Close
(workarea.go:95): two goroutines genuinely modifying the
SAME wam.aliases map.
- `concurrent map writes` panic at DBFArea.FieldPosCache
(dbf.go:439): two goroutines lazy-initing the SAME
fieldPosMap.
Drop the cache. The type assertion is ~ns; not worth a
process-global shared slot. If perf matters again, replace
with a sync.Map keyed by thread pointer, not a single struct.
2. Per-path mmap generation registry (hbrdd/dbf/area_registry.go)
Each unique on-disk DBF path gets an atomic uint64 generation
counter. *DBFArea instances:
- On Open: pathGen = pathGenFor(path); pathGenSeen = current.
- On Append (shared) / flushRecord: bumpPathGen(path);
pathGenSeen = current.
- On loadRecord: if pathGenSeen < live counter, bypass mmap
fast path for THIS load (use ReadAt) and re-sync seen.
Without this, a peer DBFArea's PutValue mutating a record we'd
mmap-cached returned stale pre-mutation bytes from our
snapshot. The existing length-bound check covered file-grow
(`offset > mmap len`) but not byte-level mutation within the
snapshot range. The registry covers both.
Cheap: read = one atomic.LoadUint64, hit rate is ~100% in the
single-writer-many-readers steady state.
Verification
------------
Same 3 / 5 / 10-worker pgx-driven concurrency stress harness:
pre-Layer-1 baseline: ~60% pass + occasional panic
+Layer 1+2: 80% / 50% / panic
+Layer 3a (max-merge): 80% / 50% / panic
+Layer 4a (per-session 3): 90% / 80% / 50%
+Layer 4b (Go atomics): 75-90% / 50-80% / panic (still)
+THIS (getWA + mmap-gen): 73% / 67% / 33% — ZERO PANICS
The shift "many partial fails, no panics" is what matters for
production: a connection seeing stale data is recoverable (rerun
the query); a Go-level process crash is not. Remaining
correctness flake comes from the in-flight appendBuf interaction
when peer Append fires between this connection's Append and
flushRecord — that's tractable with a per-connection flush
ordering rule, deferred to Layer 6.
All six release gates green:
go test ./... ✓
FiveSql2 SQL:1999 43/43 ✓
Harbour compat 56/56 ✓
std.ch 17/17 ✓
FRB 7/7 ✓
pgserver integration 6/6 ✓
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
This commit is contained in:
31
hbrtl/rdd.go
31
hbrtl/rdd.go
@@ -109,24 +109,25 @@ func rtlFieldGet(t *hbrt.Thread) {
|
||||
t.RetValue()
|
||||
}
|
||||
|
||||
// getWA returns the WorkAreaManager with cached type assertion.
|
||||
var waCache = struct {
|
||||
iface interface{}
|
||||
wam *hbrdd.WorkAreaManager
|
||||
}{}
|
||||
|
||||
// getWA resolves the WorkAreaManager attached to this thread.
|
||||
//
|
||||
// The previous version cached the last-seen interface→*WAM pair in
|
||||
// a process-global struct to skip the type assertion. That cache
|
||||
// was the worst-of-both-worlds under multi-pgserver-connection
|
||||
// load: each connection's thread has its own WAM, so the cache
|
||||
// missed on every call and immediately re-wrote two shared
|
||||
// `interface{}` fields. Go's interface is a two-word value, so a
|
||||
// concurrent write+read produced torn pointers — different
|
||||
// goroutines saw the WRONG WAM as their own, leading to the
|
||||
// FieldPosCache + WAM.aliases "concurrent map writes" panics.
|
||||
//
|
||||
// The type assertion itself is fast (~ns). Drop the cache; if the
|
||||
// micro-bench matters again, replace with a sync/atomic.Pointer
|
||||
// or sync.Map keyed by thread, not a single global slot.
|
||||
func getWA(t *hbrt.Thread) *hbrdd.WorkAreaManager {
|
||||
if t.WA == nil {
|
||||
return nil
|
||||
}
|
||||
if t.WA == waCache.iface {
|
||||
return waCache.wam
|
||||
}
|
||||
wa, ok := t.WA.(*hbrdd.WorkAreaManager)
|
||||
if !ok {
|
||||
return nil
|
||||
}
|
||||
waCache.iface = t.WA
|
||||
waCache.wam = wa
|
||||
wa, _ := t.WA.(*hbrdd.WorkAreaManager)
|
||||
return wa
|
||||
}
|
||||
|
||||
Reference in New Issue
Block a user