Files
pgz-sport/_audit/data_integrity_20260505_0836/D_violations.md
T

5.4 KiB

Subagent D — Skipped constraints, violator samples

Two candidate constraints were SKIPPED at apply-time because pre-existing rows would have been rejected. They are documented here so Damir can decide whether to clean the data and re-attempt the constraint, or accept the current state.

The trigger clanovi_normalize_trigger already enforces both rules for new inserts and for name-changing updates, so future data ingest cannot reintroduce these patterns. Only retroactive enforcement on existing rows is deferred.


C3 — CHECK (length(ime)>=2 AND length(prezime)>=2) — SKIPPED

Violator count: 22 rows.

Two clusters:

  1. Single-letter prezimeid=1160 and id=1165, both klub_id=848:

    • ('Boris Mičetić','B') — note the embedded space in ime; the surname appears truncated to a single initial.
    • ('Boris Mičetić','J') — same pattern.
    • Decision suggestion: probably real-name parse errors. Resolve manually in clanovi.
  2. Placeholder ime='-' (single dash) — 20 rows, klub_id mostly NULL plus one with klub_id=3896:

id klub_id ime prezime
1852 NULL - Grabovac
1853 NULL - Pilepić
1854 NULL - Maslak
1855 NULL - Jugo
1856 NULL - Miličević
1857 NULL - Marjanović
1858 NULL - Poljak
1859 NULL - Kurelić
2021 3896 - Mohorić
2125 NULL - Mittrovich (braća)
2130 NULL - Loich
2131 NULL - Paulinich
2132 NULL - Zidarich
2133 NULL - Bertok
2134 NULL - Marincich
2135 NULL - Tiblias
2138 NULL - Veselica
2139 NULL - Naumović
2140 NULL - Osojnak
2141 NULL - Medle

These look like historical / surname-only roster entries (note napomena on id=1852 mentions "POVIJESNI: KK Kvarner najtrofejnija generacija …" so the cluster is intentional historical data with unknown given name).

Decision suggestion: replace ime='-' with ime='?' is also rejected; either backfill the given names from a source, mark them inactive/historical in another column, or accept the data and never enable C3.


C6 — UNIQUE (klub_id, lower(ime), lower(prezime), COALESCE(datum_rodenja,'0001-01-01')) — SKIPPED

Conflict groups: 68 (each group has 2+ rows that would collide).

Most are concentrated on klub_id=2362 (HNK Rijeka roster) where the same player appears twice — once with datum_rodenja IS NULL and once also with NULL DOB but a different scrape source / older id. Sample:

klub_id l_ime l_prez dob dups ids
2362 amer gojak NULL 2 {3402, 4214}
2362 leon šerifi NULL 2 {3334, 4238}
2362 ante oreč NULL 2 {1581, 4230}
2362 branko pavić NULL 2 {2715, 4231}
2362 lovro kitin NULL 2 {3481, 4220}
2362 ante majstorović NULL 2 {3456, 4224}
2362 dejan petrovič NULL 2 {3399, 4232}
2362 duje čop NULL 2 {1579, 4211}
2362 fran škalamera NULL 2 {3480, 4239}
2362 gabriel rukavina NULL 2 {3404, 4234}
2362 bruno bogojević NULL 2 {3437, 4208}
2362 aleksa todorović NULL 2 {3455, 4202}
2362 cherno saho NULL 2 {3403, 4235}
2362 luka menalo NULL 2 {3454, 4226}
2362 martin zlomislić NULL 2 {3440, 4203}
2362 mladen devetak NULL 2 {3400, 4212}
2362 niko janković NULL 2 {3607, 4218}
2362 noel bodetić NULL 2 {3705, 4207}
2362 silvio ilinković NULL 2 {3412, 4217}
2362 šimun butić NULL 2 {3401, 4209}
2362 stjepan radeljić NULL 2 {3448, 4233}
2362 toni fruk 2001-03-09 2 {3438, 4135}
2362 vito kovač NULL 2 {3298, 4201}
2362 jovan manev NULL 2 {3439, 4225}
2585 ivo butrica NULL 2 {2282, 4163}
2585 luko ledinić NULL 2 {2283, 4164}
2586 siniša saftić NULL 2 {2298, 4165}
2587 damir poslek NULL 2 {2310, 4167}
2589 matej viduka NULL 2 {2340, 4174}
2589 čedo vukelić NULL 2 {2339, 4175}

(38 more groups not shown — query reproduction below.)

Cause: the dedup-fold key collapses on COALESCE(NULL, '0001-01-01'), so two records of the same name+klub with missing DOB look identical even when they are distinct profiles (different profile_url, source_id, hns_igrac_id). Today's working composite key is the existing uq_clanovi_klub_profile (klub_id, profile_url) which is already enforced.

Decision suggestion: do NOT enable C6 as-is. Either (a) restrict the uniqueness to WHERE datum_rodenja IS NOT NULL, or (b) merge true dupes via a follow-up subagent that promotes one row and back-fills hns_igrac_id / profile_url. Until then, ingestion is still protected by uq_clanovi_klub_profile and (for HNS-keyed players) clanovi_hns_uniq.

Reproduce full list

SELECT klub_id, lower(ime) AS l_ime, lower(prezime) AS l_prez,
       COALESCE(datum_rodenja, '0001-01-01'::date) AS dob,
       count(*) AS dups,
       array_agg(id ORDER BY id) AS ids
FROM pgz_sport.clanovi
GROUP BY klub_id, lower(ime), lower(prezime), COALESCE(datum_rodenja, '0001-01-01'::date)
HAVING count(*) > 1
ORDER BY dups DESC, klub_id;