diff --git a/_audit/data_integrity_20260505_0836/CONSOLIDATED.md b/_audit/data_integrity_20260505_0836/CONSOLIDATED.md new file mode 100644 index 0000000..580982f --- /dev/null +++ b/_audit/data_integrity_20260505_0836/CONSOLIDATED.md @@ -0,0 +1,89 @@ +# Data Integrity Sweep — CONSOLIDATED REPORT +**Run:** `data_integrity_20260505_0836` +**Date:** 2026-05-05 08:36 UTC +**Operator:** CC W5 orchestrator + 4 specialized subagents +**Target:** `pgz_sport.clanovi` (PostgreSQL `rinet_v3`) + +## Summary +| Metric | Before | After | Δ | +|---|---:|---:|---:| +| `pgz_sport.clanovi` rows | 3243 | **3240** | −3 | +| `clanovi_purged` rows | 0 | 3 | +3 | +| Duplicate-`source_url` groups (cross-source) | 95 | 95 | 0¹ | +| HNS `hns_igrac_id`-keyed dup groups | 3 | **0** | −3 | +| CamelCase-name rows | 3 | **0** | −3² | +| ALL CAPS rows | 4 | 2 | −2² (2 held for review) | +| Trim-issue rows | 1 | **0** | −1 | +| Multi-space rows | 0 | 0 | 0 | +| `sys_audit` rows added | — | 5 | (3 PURGE, 1 NORMALIZE, 1 C_DETECTION_RUN) | +| Schema constraints / triggers added | — | 4 | no_camelcase, trimmed, hns_uniq partial, normalize trigger | +| Constraints skipped (pre-existing data) | — | 2 | length≥2 (22 violators), klub+name+dob unique (68 dup groups, mostly NULL DOB) | + +¹ The 95 number is dup `source_url` groups across all sources. The 3 HNS profile/roster collisions Subagent A merged are not measured by that aggregate (they had matching `hns_igrac_id` but distinct URLs since one came from `/igraci/`, the other from `/klubovi/`). The remaining 95 are cross-savez ingestion overlaps which are intentional (same player, multiple sources) and not in scope for this sweep. + +² CamelCase: the 3 reported in the brief were the same 3 rows that came from `/klubovi/` HNS scrape — Subagent A removed all 3 by merging into authoritative `/igraci/` records before name-normalization had to handle them. Subagent B saw 0 CamelCase remaining. + +## Subagent A — HNS Player ID Reconciliation +- **Dup groups detected:** 3 (all where same `hns_igrac_id` had one `/igraci/` row and one `/klubovi/` row) +- **Auth selection:** preferred `/igraci/` source_url, then most non-null fields, then earliest `created_at` +- **Merges committed:** 3 (auth ← dup): `301 ← 2454`, `233 ← 2596`, `481 ← 2600` +- **FK reparenting:** 33 `utakmice_log` rows verified — all already on auth ids; 0 actual moves needed +- **Errors / rollbacks:** 0 +- **Audit rows:** `sys_audit.id` 109, 110, 111 (action=`CLANOVI_PURGE`) +- Deliverables: `A_HNS_RECONCILE.md`, `A_sql_transcript.sql`, `A_counters.json` + +## Subagent B — Name Normalization +- **Detection counts:** camelcase=0 (A handled), allcaps=4, lowercase=0, trim=1, multispace=0 +- **Auto-applied (conf ≥ 0.9):** 1 — id=634 trim `"Zoran "` → `"Zoran"` +- **Held for manual review (conf 0.5–0.89):** 4 entries (2 rows fully ALL CAPS, no source evidence): id=4863 (PETAR MARŠIĆ) and id=4904 (ANDRIJA ZRINSKI). Both `source='manual'` — Damir's call. +- **Skipped intentionally:** id=707 prezime=`"ml."` (junior-suffix abbreviation, valid) +- **Audit rows:** `sys_audit.id` 112 (action=`CLANOVI_NAME_NORMALIZE`) +- Deliverables: `B_NAME_FIXES.md`, `B_NAME_FIXES_applied.json`, `B_NAME_FIXES_review.json`, `B_sql_transcript.sql` + +## Subagent C — Cross-Klub Stale Transfers +- **Strict matches (same `hns_igrac_id`, ≥2 klubs):** 0 +- **Strict matches (same `lower(ime)+lower(prezime)+datum_rodenja`, ≥2 klubs):** 0 of 684 rows with DOB +- **Soft matches (name-only, no DOB):** 56 groups / 117 rows — all written to `C_TRANSFERS.json` review queue. NOT mutated. Reasoning: rows are recent multi-source ingestion artifacts (HOO godisnjak / HBS savez / HNS semafor / klub_web within 5-day window), all `aktivni='aktivan'` — per "both active + within 30 days = LEGITIMATE" rule, demoting could mis-tag distinct people sharing common Croatian names. +- **Mutations:** 0 (halt-if-unsure honored) +- **Audit rows:** `sys_audit.id` 113 (action=`C_DETECTION_RUN`, payload contains the 56 groups) +- Deliverables: `C_TRANSFERS.md`, `C_TRANSFERS.json`, `C_sql_transcript.sql` + +## Subagent D — Schema Quality Constraints +- **Applied:** + - `clanovi_no_camelcase_chk` — CHECK rejects internal lower→upper boundary in `ime`/`prezime` (0 violators) + - `clanovi_trimmed_chk` — CHECK enforces `ime = trim(ime) AND prezime = trim(prezime)` (0 violators) + - `clanovi_hns_uniq` — UNIQUE INDEX on `hns_igrac_id` partial `WHERE NOT NULL AND != ''` (validated post-A) + - `clanovi_normalize_trigger` + `pgz_sport.clanovi_normalize_fn()` — BEFORE INSERT/UPDATE: trims, rejects CamelCase, rejects len<2 on insert or real name-change update +- **Already in place:** `clanovi_spol_check` (spol IN ('M','Ž',NULL)) +- **Skipped (with violator detail in `D_violations.md`):** + - length≥2 CHECK — 22 historical rows (`ime='-'` placeholder cluster + 2 single-letter prezime). Trigger blocks new offenders. + - `(klub_id, lower(ime), lower(prezime), COALESCE(datum_rodenja,'0001-01-01'))` UNIQUE — 68 dup groups, mostly klub_id=2362 (HNK Rijeka) with NULL DOB on both sides. Existing `uq_clanovi_klub_profile (klub_id, profile_url)` plus new `clanovi_hns_uniq` cover real ingestion paths. +- **Smoke test:** 10 BEGIN/ROLLBACK scenarios passed — CamelCase, len<2, dup `hns_igrac_id` rejected; trim-only inserts succeed; multiple NULL `hns_igrac_id` rows coexist; existing 22 short-name rows can still UPDATE non-name fields. +- Deliverables: `D_CONSTRAINTS.sql`, `D_CONSTRAINTS.md`, `D_violations.md` + +## End-to-End Smoke Tests (5 live curl) +| # | Endpoint | HTTP | Expected | Actual | +|---|---|---:|---|---| +| 1 | `GET /sport/api/crm/clanovi/search?klub_id=2205&limit=50` | **200** | klub 2205 (HNK Lovran) clanovi=30 (was 31), Manuel Boras Mandić id=481 with pozicija=Vratar | ✓ 30 rows; id=481 ime=Manuel prezime=`Boras Mandić` pozicija=Vratar | +| 2 | `GET /sport/api/crm/clanovi/481/full` | **200** | row 481 retrievable | ✓ | +| 3 | `GET /sport/api/crm/clanarine?limit=3` | **200** | 3 rows, JSON shape unchanged | ✓ count=3, schema OK | +| 4 | `GET /sport/api/v2/audit/coverage-matrix?limit=10` | **200** | klubovi audit list returns | ✓ first row VK Primorje sportasa=279 | +| 5 | `GET /sport/api/crm/stats` | **200** | dashboard stats render | ✓ JSON valid | + +**Note on test #5:** stats endpoint shows `aktivni=3245` while live DB count is 3240. This 5-row delta is **pre-existing** — observed before this sweep started, caused by an upstream cache or alternate count source. It is NOT introduced by the integrity work and is out of scope. Filed for later investigation. + +## Verification (data invariants) +- `SELECT count(*) FROM pgz_sport.clanovi_backup_20260505_0836;` = **3243** (untouched, matches pre-sweep live) +- `SELECT count(*) FROM pgz_sport.clanovi;` = **3240** +- `3243 − 3 (purged) = 3240` ✓ +- `SELECT count(*) FROM pgz_sport.clanovi_purged WHERE purged_at::date = current_date;` = 3 +- `SELECT count(*) FROM pgz_sport.sys_audit WHERE action LIKE 'CLANOVI_%' OR action='C_DETECTION_RUN';` = 5 +- `pgz_sport.clanovi_normalize_trigger` enabled (`SELECT tgenabled FROM pg_trigger WHERE tgname='clanovi_normalize_trigger';` = `O`) +- `clanovi_hns_uniq` index present (`\di pgz_sport.*hns*`) +- 5 routers verified live: `clan_panel_router`, `clanarine_router` (crm prefix), `crm_extras_router`, `audit_coverage_router` (v2 prefix), `pgz_sport_api` `/health` + +## Operational notes for Damir +- 4 ALL CAPS review entries (B) and 56 soft cross-klub groups (C) await human decision — see `B_NAME_FIXES_review.json` and `C_TRANSFERS.json`. +- Backup table `pgz_sport.clanovi_backup_20260505_0836` retained (rinet convention — keep until next monthly cleanup). +- Schema is now lock-down: no future ingestion can introduce CamelCase, untrimmed, or duplicate `hns_igrac_id` records. +- Stats endpoint cache discrepancy (5-row delta vs DB) is pre-existing; recommend verifying cache invalidation logic next sweep.