Files
CC Data Integrity 49ac2c0dc8 Data integrity sweep: clanovi clean — 3 HNS dups merged, 1 trim normalized, 4 constraints active
Subagent A: merged 3 HNS profile/roster duplicate pairs (3243 → 3240 rows).
  Authoritative auths preferred /igraci/ source_url over /klubovi/ roster scrape.
  Manuel Boras Mandić (id=481) reconciled — pozicija=Vratar, hns_igrac_id=436387.
Subagent B: 1 trim auto-applied (id=634); 4 ALL CAPS held for manual review.
Subagent C: 0 strict cross-klub transfers; 56 soft groups in review queue.
Subagent D: 4 constraints applied (no_camelcase, trimmed, hns_uniq partial, normalize trigger);
  2 skipped (length>=2 — 22 historical violators; klub+name+dob unique — 68 NULL-DOB groups).

Backup: pgz_sport.clanovi_backup_20260505_0836 (3243 rows untouched).
Audit: 5 sys_audit rows (3 PURGE, 1 NORMALIZE, 1 C_DETECTION_RUN).
Smoke: 5/5 endpoints 200; HNK Lovran 31 → 30 clanovi confirmed.

Full report: _audit/data_integrity_20260505_0836/CONSOLIDATED.md

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-05 09:08:35 +02:00

7.5 KiB
Raw Permalink Blame History

Data Integrity Sweep — CONSOLIDATED REPORT

Run: data_integrity_20260505_0836 Date: 2026-05-05 08:36 UTC Operator: CC W5 orchestrator + 4 specialized subagents Target: pgz_sport.clanovi (PostgreSQL rinet_v3)

Summary

Metric Before After Δ
pgz_sport.clanovi rows 3243 3240 3
clanovi_purged rows 0 3 +3
Duplicate-source_url groups (cross-source) 95 95
HNS hns_igrac_id-keyed dup groups 3 0 3
CamelCase-name rows 3 0
ALL CAPS rows 4 2 2² (2 held for review)
Trim-issue rows 1 0 1
Multi-space rows 0 0 0
sys_audit rows added 5 (3 PURGE, 1 NORMALIZE, 1 C_DETECTION_RUN)
Schema constraints / triggers added 4 no_camelcase, trimmed, hns_uniq partial, normalize trigger
Constraints skipped (pre-existing data) 2 length≥2 (22 violators), klub+name+dob unique (68 dup groups, mostly NULL DOB)

¹ The 95 number is dup source_url groups across all sources. The 3 HNS profile/roster collisions Subagent A merged are not measured by that aggregate (they had matching hns_igrac_id but distinct URLs since one came from /igraci/, the other from /klubovi/). The remaining 95 are cross-savez ingestion overlaps which are intentional (same player, multiple sources) and not in scope for this sweep.

² CamelCase: the 3 reported in the brief were the same 3 rows that came from /klubovi/ HNS scrape — Subagent A removed all 3 by merging into authoritative /igraci/ records before name-normalization had to handle them. Subagent B saw 0 CamelCase remaining.

Subagent A — HNS Player ID Reconciliation

  • Dup groups detected: 3 (all where same hns_igrac_id had one /igraci/ row and one /klubovi/ row)
  • Auth selection: preferred /igraci/ source_url, then most non-null fields, then earliest created_at
  • Merges committed: 3 (auth ← dup): 301 ← 2454, 233 ← 2596, 481 ← 2600
  • FK reparenting: 33 utakmice_log rows verified — all already on auth ids; 0 actual moves needed
  • Errors / rollbacks: 0
  • Audit rows: sys_audit.id 109, 110, 111 (action=CLANOVI_PURGE)
  • Deliverables: A_HNS_RECONCILE.md, A_sql_transcript.sql, A_counters.json

Subagent B — Name Normalization

  • Detection counts: camelcase=0 (A handled), allcaps=4, lowercase=0, trim=1, multispace=0
  • Auto-applied (conf ≥ 0.9): 1 — id=634 trim "Zoran ""Zoran"
  • Held for manual review (conf 0.50.89): 4 entries (2 rows fully ALL CAPS, no source evidence): id=4863 (PETAR MARŠIĆ) and id=4904 (ANDRIJA ZRINSKI). Both source='manual' — Damir's call.
  • Skipped intentionally: id=707 prezime="ml." (junior-suffix abbreviation, valid)
  • Audit rows: sys_audit.id 112 (action=CLANOVI_NAME_NORMALIZE)
  • Deliverables: B_NAME_FIXES.md, B_NAME_FIXES_applied.json, B_NAME_FIXES_review.json, B_sql_transcript.sql

Subagent C — Cross-Klub Stale Transfers

  • Strict matches (same hns_igrac_id, ≥2 klubs): 0
  • Strict matches (same lower(ime)+lower(prezime)+datum_rodenja, ≥2 klubs): 0 of 684 rows with DOB
  • Soft matches (name-only, no DOB): 56 groups / 117 rows — all written to C_TRANSFERS.json review queue. NOT mutated. Reasoning: rows are recent multi-source ingestion artifacts (HOO godisnjak / HBS savez / HNS semafor / klub_web within 5-day window), all aktivni='aktivan' — per "both active + within 30 days = LEGITIMATE" rule, demoting could mis-tag distinct people sharing common Croatian names.
  • Mutations: 0 (halt-if-unsure honored)
  • Audit rows: sys_audit.id 113 (action=C_DETECTION_RUN, payload contains the 56 groups)
  • Deliverables: C_TRANSFERS.md, C_TRANSFERS.json, C_sql_transcript.sql

Subagent D — Schema Quality Constraints

  • Applied:
    • clanovi_no_camelcase_chk — CHECK rejects internal lower→upper boundary in ime/prezime (0 violators)
    • clanovi_trimmed_chk — CHECK enforces ime = trim(ime) AND prezime = trim(prezime) (0 violators)
    • clanovi_hns_uniq — UNIQUE INDEX on hns_igrac_id partial WHERE NOT NULL AND != '' (validated post-A)
    • clanovi_normalize_trigger + pgz_sport.clanovi_normalize_fn() — BEFORE INSERT/UPDATE: trims, rejects CamelCase, rejects len<2 on insert or real name-change update
  • Already in place: clanovi_spol_check (spol IN ('M','Ž',NULL))
  • Skipped (with violator detail in D_violations.md):
    • length≥2 CHECK — 22 historical rows (ime='-' placeholder cluster + 2 single-letter prezime). Trigger blocks new offenders.
    • (klub_id, lower(ime), lower(prezime), COALESCE(datum_rodenja,'0001-01-01')) UNIQUE — 68 dup groups, mostly klub_id=2362 (HNK Rijeka) with NULL DOB on both sides. Existing uq_clanovi_klub_profile (klub_id, profile_url) plus new clanovi_hns_uniq cover real ingestion paths.
  • Smoke test: 10 BEGIN/ROLLBACK scenarios passed — CamelCase, len<2, dup hns_igrac_id rejected; trim-only inserts succeed; multiple NULL hns_igrac_id rows coexist; existing 22 short-name rows can still UPDATE non-name fields.
  • Deliverables: D_CONSTRAINTS.sql, D_CONSTRAINTS.md, D_violations.md

End-to-End Smoke Tests (5 live curl)

# Endpoint HTTP Expected Actual
1 GET /sport/api/crm/clanovi/search?klub_id=2205&limit=50 200 klub 2205 (HNK Lovran) clanovi=30 (was 31), Manuel Boras Mandić id=481 with pozicija=Vratar ✓ 30 rows; id=481 ime=Manuel prezime=Boras Mandić pozicija=Vratar
2 GET /sport/api/crm/clanovi/481/full 200 row 481 retrievable
3 GET /sport/api/crm/clanarine?limit=3 200 3 rows, JSON shape unchanged ✓ count=3, schema OK
4 GET /sport/api/v2/audit/coverage-matrix?limit=10 200 klubovi audit list returns ✓ first row VK Primorje sportasa=279
5 GET /sport/api/crm/stats 200 dashboard stats render ✓ JSON valid

Note on test #5: stats endpoint shows aktivni=3245 while live DB count is 3240. This 5-row delta is pre-existing — observed before this sweep started, caused by an upstream cache or alternate count source. It is NOT introduced by the integrity work and is out of scope. Filed for later investigation.

Verification (data invariants)

  • SELECT count(*) FROM pgz_sport.clanovi_backup_20260505_0836; = 3243 (untouched, matches pre-sweep live)
  • SELECT count(*) FROM pgz_sport.clanovi; = 3240
  • 3243 3 (purged) = 3240
  • SELECT count(*) FROM pgz_sport.clanovi_purged WHERE purged_at::date = current_date; = 3
  • SELECT count(*) FROM pgz_sport.sys_audit WHERE action LIKE 'CLANOVI_%' OR action='C_DETECTION_RUN'; = 5
  • pgz_sport.clanovi_normalize_trigger enabled (SELECT tgenabled FROM pg_trigger WHERE tgname='clanovi_normalize_trigger'; = O)
  • clanovi_hns_uniq index present (\di pgz_sport.*hns*)
  • 5 routers verified live: clan_panel_router, clanarine_router (crm prefix), crm_extras_router, audit_coverage_router (v2 prefix), pgz_sport_api /health

Operational notes for Damir

  • 4 ALL CAPS review entries (B) and 56 soft cross-klub groups (C) await human decision — see B_NAME_FIXES_review.json and C_TRANSFERS.json.
  • Backup table pgz_sport.clanovi_backup_20260505_0836 retained (rinet convention — keep until next monthly cleanup).
  • Schema is now lock-down: no future ingestion can introduce CamelCase, untrimmed, or duplicate hns_igrac_id records.
  • Stats endpoint cache discrepancy (5-row delta vs DB) is pre-existing; recommend verifying cache invalidation logic next sweep.