Files
five/_FiveSql2
CharlesKWON 99f3ca5687 perf(FiveSql2): EXISTS semi-join lift — H3 correlated EXISTS ~2000x faster
Correlated EXISTS with high-cardinality keys was stuck at O(outer × inner)
because memoization couldn't amortize across unique correlation values.
H3 in the subquery stress bench:

    SELECT e.name FROM emp e
    WHERE EXISTS (SELECT 1 FROM ord WHERE ord.emp_id = e.id AND ord.qty > 15)

500 outer rows × 500 distinct e.id values × 5000-row ord scan = 10s,
with no path to improvement from caching the subquery result.

Fix: detect the semi-join shape on the subquery and rewrite it at
runtime into a non-correlated DISTINCT scan whose result is cached
as a hash set. Each outer row then becomes an O(1) hash probe.

--- What we lift ---

  SELECT ... FROM inner_table
  WHERE inner.col = outer.col [AND other_non_correlated_preds]

Shape constraints (all must hold):
  - single table, no JOIN
  - no GROUP BY, no HAVING, no UNION
  - WHERE is an AND tree containing an equi-term where one side is
    a column with an alias prefix from the subquery's own FROM
    and the other is a column from an outer alias
  - the remaining AND terms (non-correlated residue) have no
    outer references of their own — rules out patterns like
    `WHERE e2.dept = e.dept AND e2.salary > e.salary` where the
    second term can't live without the outer context

--- How the lift works ---

1. Walk the WHERE as a flat AND-term list
2. Find and remove the first correlated equi-term, remember the
   inner column name and outer column reference
3. Verify residue is non-correlated via a recursive AST walker
   (SemiJoinHasOuterRef) — bail to fallback if not
4. Clone hQuery with:
     columns  = {DISTINCT inner.col}
     where    = residue (or NIL)
     distinct = .T.
     limit / top / order_by / group_by / having cleared
5. Run the cloned subquery once via a nested TSqlExecutor — no
   PushOuter because it's now non-correlated
6. Build a hash set keyed on SqlValToStr(each distinct inner value)
7. Per EXISTS probe: Resolve the outer column reference, look up
   in the hash set

Cached in ::aSemiJoinSlots indexed by xSubNode identity so the
analysis + lifted scan runs exactly once per subquery expression.
Subqueries that don't match the shape store the sentinel "NO" so
subsequent probes skip re-analysis and fall through to the existing
SubqueryCached + LIMIT 1 path.

NOT EXISTS works through the same path — lNegate flag just flips
the final hash-lookup result.

--- Bench (emp=500, prod=100, ord=5k) ---

  Pattern                         Before      After      Speedup
  ────────────────────────────────────────────────────────────
  H3  EXISTS correlated           10.0s       4.5ms      ~2200x
  H8  NOT EXISTS self-join         900ms      890ms      same (can't lift:
                                                          remainder
                                                          `e2.salary > e.salary`
                                                          is correlated)
  H11 Scalar + EXISTS + derived    3.2s       1.0s       3.2x

H8 correctly falls through to the non-lifted path because the
remainder outer-reference check (SemiJoinHasOuterRef) rejects the
`e2.salary > e.salary` term. The 5-row answer is still correct.

Validation:
  - FiveSql2 43/43
  - Harbour compat 51/51
  - go test ./... ALL PASS
  - H3 returns 125 rows (matches pre-change correct result)
  - H8 returns 5 rows (matches pre-change correct result)

Known pre-existing bug, unrelated: H7 (scalar correlated subquery
with inner INNER JOIN) returns zero for rows 2..N — workarea state
leaks between consecutive subquery invocations. Not touched here,
filed for follow-up.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-04-15 23:06:35 +09:00
..

FiveSql2 — SQL Engine for Harbour DBF/NTX/CDX

Pratt parser + SQL:1992-2023 full standard support Supports both NTX (Clipper) and CDX (FoxPro/ADS) indexes

Architecture

five_SQL("SELECT ...")
   │
   ├── TSqlLexer        Tokenizer
   ├── TSqlParser2      Pratt parser (data-driven operators)
   ├── TSqlExecutor     Query executor (Volcano model)
   │     ├── TSqlAlias  Central alias manager (no collisions)
   │     ├── TSqlIndex  NTX/CDX index optimization (auto-detect)
   │     ├── TSqlAgg    GROUP BY / aggregation
   │     ├── TSqlSort   ORDER BY / DISTINCT
   │     ├── TSqlDDL    CREATE/DROP/ALTER TABLE/INDEX
   │     └── TSqlTxn    BEGIN/COMMIT/ROLLBACK
   ├── TSqlExpr         AST nodes + expression evaluation
   └── TSqlFunc         60+ scalar functions

Build & Test

export PATH="/path/to/harbour-core/bin/linux/gcc:$PATH"
export HB_INSTALL_PREFIX="/path/to/harbour-core"

make          # Build all tests
make test     # Run all 157 tests
make bench    # Parser benchmark
make clean    # Clean

SQL Standard Coverage

Standard Features Tests
SQL:1992 SELECT, JOIN, GROUP BY, HAVING, Subquery, CASE, CAST 43
SQL:1999 CTE, Recursive CTE, Window Functions, MERGE 10
SQL:2003 SIMILAR TO, GROUPING SETS, LATERAL, Window frames 64
SQL:2008 FETCH/OFFSET, FOR UPDATE, Extended MERGE (incl.)
SQL:2016 JSON functions, LISTAGG (incl.)
SQL:2023 ANY_VALUE, GREATEST/LEAST, BOOL_AND/OR (incl.)
Challenge LeetCode-level complex queries 15
Extreme Production analytics stress tests 15

Adding New Operators

Edit TSqlParser2.prg, method InitInfixTables():

::hInfixTT[ TK_MYOP ] := { "<=>", 40, 41, ND_BIN }

One line. No structural changes needed.

Copyright (c) 2025-2026 Charles KWON (Charles KWON OhJun) Email: charleskwonohjun@gmail.com All rights reserved.