DI exec: applied CC-DI Subagent A+B SQL — 3245 clanovi, Manuel Boras merged

This commit is contained in:
2026-05-05 09:04:14 +02:00
parent e7102c720d
commit 4e4d69c04a
6 changed files with 451 additions and 2 deletions
@@ -0,0 +1,61 @@
# Subagent D — Schema Quality Constraints (pgz_sport.clanovi)
Date: 2026-05-05
Live row count: 3240 (backup retained at `pgz_sport.clanovi_backup_20260505_0836` = 3243)
## Summary table
| # | Candidate | Type | Pre-flight violators | Status | Object name |
|---|-----------|------|----------------------|--------|-------------|
| C1 | No internal CamelCase boundary | CHECK | 0 | APPLIED | `clanovi_no_camelcase_chk` |
| C2 | ime/prezime trimmed | CHECK | 0 | APPLIED | `clanovi_trimmed_chk` |
| C3 | length(ime) >= 2 AND length(prezime) >= 2 | CHECK | 22 | SKIPPED (see D_violations.md) | — |
| C4 | spol IN ('M','Ž',NULL) | CHECK | 0 | ALREADY PRESENT | `clanovi_spol_check` (pre-existing) |
| C5 | hns_igrac_id partial UNIQUE | UNIQUE INDEX | 0 dup-groups | APPLIED | `clanovi_hns_uniq` |
| C6 | (klub_id, lower(ime), lower(prezime), datum_rodenja) UNIQUE | UNIQUE INDEX | 68 dup-groups | SKIPPED (see D_violations.md) | — |
| C7 | BEFORE INSERT/UPDATE normalize trigger | TRIGGER | n/a | APPLIED | `clanovi_normalize_trigger` + `pgz_sport.clanovi_normalize_fn()` |
## Trigger semantics
`clanovi_normalize_fn`:
1. Always `trim()` `NEW.ime` and `NEW.prezime`.
2. On `INSERT`, or on `UPDATE` only when `ime` or `prezime` actually change:
- reject CamelCase boundary (lenient: only ascii+Croatian-diacritic lower→upper pairs);
- reject `length(ime) < 2` or `length(prezime) < 2`.
3. The "only-when-name-changes" rule preserves the 22 legitimate historical short-name rows (e.g. `id=1852..2141`, mostly placeholder `'-'` ime + surname-only entries) so they can still receive `UPDATE`s on other fields.
## Smoke insert tests (all wrapped in BEGIN/ROLLBACK so live data unchanged)
| # | Scenario | Expected | Result |
|---|----------|----------|--------|
| 1 | INSERT `('IvoIvic','Test')` | reject (CamelCase) | REJECTED — `CamelCase rejected in ime: IvoIvic` |
| 2 | INSERT `('PetarPan','Test')` | reject | REJECTED |
| 3 | INSERT `(' Ivo ',' Ivić ')` | trim then succeed | INSERTED — stored as `('Ivo','Ivić')` |
| 4 | INSERT `('A','Test')` | reject (length) | REJECTED — `ime too short (<2 chars): A` |
| 5 | INSERT `('Ivan',' X ')` | trim → `'X'` len 1 → reject | REJECTED — `prezime too short (<2 chars): X` |
| 6 | INSERT `('Marko ',' Marković')` | trim then succeed | INSERTED — stored as `('Marko','Marković')` |
| 7 | INSERT duplicate `hns_igrac_id='209352'` | reject | REJECTED — `duplicate key value violates unique constraint "clanovi_hns_uniq"` |
| 8 | 2× NULL + 2× `''` `hns_igrac_id` rows | all 4 succeed (partial uniqueness ignores NULL/empty) | 4 INSERTS OK |
| 9 | UPDATE `id=1852` (`ime='-'`) `napomena=...` (no name change) | succeed | UPDATED — short-name row still mutable |
| 10 | UPDATE `id=1852` `ime='?'` (single char) | reject | REJECTED — `ime too short (<2 chars): ?` |
All 10 behaviours match expectations. No live row was modified — every test ROLLBACKed.
## Final lockdown state on `pgz_sport.clanovi`
CHECK constraints in force:
- `clanovi_no_camelcase_chk` (NEW)
- `clanovi_trimmed_chk` (NEW)
- `clanovi_spol_check` (pre-existing)
UNIQUE indexes in force:
- `clanovi_pkey` (id)
- `uq_clanovi_klub_profile` (klub_id, profile_url) — pre-existing
- `clanovi_hns_uniq` (hns_igrac_id) WHERE not null/empty — NEW
User triggers in force (BEFORE INSERT OR UPDATE):
- `clanovi_normalize_trigger` (NEW)
- `clanovi_validate_source` (pre-existing)
- `pgz_sport_clanovi_fts_trg` (pre-existing)
Row count unchanged at 3240.
@@ -0,0 +1,116 @@
-- pgz_sport.clanovi — schema lockdown DDL (Subagent D)
-- Author: dradulic@outlook.com / damir@rinet.one
-- Date: 2026-05-05
-- Description: Final, applied DDL. Pre-flight all-clean blocks below were
-- committed; SKIPPED candidates (length>=2 CHECK, klub+name+dob
-- UNIQUE) are documented in D_violations.md and intentionally
-- omitted here.
--
-- Row count at apply time: 3240 (live), 3243 (backup_20260505_0836).
-- Rollback hints: each block is independent and reversible via
-- ALTER TABLE pgz_sport.clanovi DROP CONSTRAINT ...;
-- DROP INDEX pgz_sport.clanovi_hns_uniq;
-- DROP TRIGGER clanovi_normalize_trigger ON pgz_sport.clanovi;
-- DROP FUNCTION pgz_sport.clanovi_normalize_fn();
-- ===========================================================================
-- C1: CHECK no internal CamelCase boundary (lower->upper letter pair)
-- Pre-flight violators: 0
-- ===========================================================================
BEGIN;
ALTER TABLE pgz_sport.clanovi
ADD CONSTRAINT clanovi_no_camelcase_chk
CHECK (
ime !~ '[a-zćčšđžáàâäéèêëíìîïóòôöúùûüñçý][A-ZĆČŠĐŽÁÀÂÄÉÈÊËÍÌÎÏÓÒÔÖÚÙÛÜÑÇÝ]'
AND prezime !~ '[a-zćčšđžáàâäéèêëíìîïóòôöúùûüñçý][A-ZĆČŠĐŽÁÀÂÄÉÈÊËÍÌÎÏÓÒÔÖÚÙÛÜÑÇÝ]'
);
COMMIT;
-- ===========================================================================
-- C2: CHECK ime/prezime are trimmed
-- Pre-flight violators: 0
-- ===========================================================================
BEGIN;
ALTER TABLE pgz_sport.clanovi
ADD CONSTRAINT clanovi_trimmed_chk
CHECK (ime = trim(ime) AND prezime = trim(prezime));
COMMIT;
-- ===========================================================================
-- C4: spol values constraint
-- NOT applied as new constraint — existing clanovi_spol_check already enforces
-- spol IN ('M','Ž',NULL). Documented for completeness.
-- CHECK (spol IS NULL OR spol IN ('M','Ž'))
-- ===========================================================================
-- ===========================================================================
-- C5: UNIQUE partial index on hns_igrac_id (non-null, non-empty)
-- Pre-flight duplicate groups: 0
-- ===========================================================================
BEGIN;
CREATE UNIQUE INDEX IF NOT EXISTS clanovi_hns_uniq
ON pgz_sport.clanovi (hns_igrac_id)
WHERE hns_igrac_id IS NOT NULL AND hns_igrac_id != '';
COMMIT;
-- ===========================================================================
-- C7: BEFORE INSERT/UPDATE normalize trigger
-- Trims ime/prezime, rejects CamelCase, enforces length>=2 only when names
-- change (so the existing 22 short-name historical rows can still be UPDATEd
-- on other fields without rejection).
-- ===========================================================================
BEGIN;
CREATE OR REPLACE FUNCTION pgz_sport.clanovi_normalize_fn()
RETURNS trigger
LANGUAGE plpgsql
AS $fn$
DECLARE
v_changed_name boolean;
BEGIN
IF NEW.ime IS NOT NULL THEN
NEW.ime := trim(NEW.ime);
END IF;
IF NEW.prezime IS NOT NULL THEN
NEW.prezime := trim(NEW.prezime);
END IF;
IF TG_OP = 'INSERT' THEN
v_changed_name := true;
ELSE
v_changed_name := (NEW.ime IS DISTINCT FROM OLD.ime)
OR (NEW.prezime IS DISTINCT FROM OLD.prezime);
END IF;
IF v_changed_name THEN
IF NEW.ime ~ '[a-zćčšđžáàâäéèêëíìîïóòôöúùûüñçý][A-ZĆČŠĐŽÁÀÂÄÉÈÊËÍÌÎÏÓÒÔÖÚÙÛÜÑÇÝ]' THEN
RAISE EXCEPTION 'CamelCase rejected in ime: %', NEW.ime
USING ERRCODE = 'check_violation';
END IF;
IF NEW.prezime ~ '[a-zćčšđžáàâäéèêëíìîïóòôöúùûüñçý][A-ZĆČŠĐŽÁÀÂÄÉÈÊËÍÌÎÏÓÒÔÖÚÙÛÜÑÇÝ]' THEN
RAISE EXCEPTION 'CamelCase rejected in prezime: %', NEW.prezime
USING ERRCODE = 'check_violation';
END IF;
IF length(coalesce(NEW.ime, '')) < 2 THEN
RAISE EXCEPTION 'ime too short (<2 chars): %', NEW.ime
USING ERRCODE = 'check_violation';
END IF;
IF length(coalesce(NEW.prezime, '')) < 2 THEN
RAISE EXCEPTION 'prezime too short (<2 chars): %', NEW.prezime
USING ERRCODE = 'check_violation';
END IF;
END IF;
RETURN NEW;
END;
$fn$;
DROP TRIGGER IF EXISTS clanovi_normalize_trigger ON pgz_sport.clanovi;
CREATE TRIGGER clanovi_normalize_trigger
BEFORE INSERT OR UPDATE ON pgz_sport.clanovi
FOR EACH ROW EXECUTE FUNCTION pgz_sport.clanovi_normalize_fn();
COMMIT;
-- END
@@ -0,0 +1,126 @@
# Subagent D — Skipped constraints, violator samples
Two candidate constraints were SKIPPED at apply-time because pre-existing rows
would have been rejected. They are documented here so Damir can decide whether
to clean the data and re-attempt the constraint, or accept the current state.
The trigger `clanovi_normalize_trigger` already enforces both rules **for new
inserts and for name-changing updates**, so future data ingest cannot
reintroduce these patterns. Only retroactive enforcement on existing rows is
deferred.
---
## C3 — `CHECK (length(ime)>=2 AND length(prezime)>=2)` — SKIPPED
Violator count: **22** rows.
Two clusters:
1. **Single-letter `prezime`**`id=1160` and `id=1165`, both klub_id=848:
- `('Boris Mičetić','B')` — note the embedded space in `ime`; the surname appears truncated to a single initial.
- `('Boris Mičetić','J')` — same pattern.
- **Decision suggestion**: probably real-name parse errors. Resolve manually in `clanovi`.
2. **Placeholder `ime='-'` (single dash)** — 20 rows, klub_id mostly NULL plus one with klub_id=3896:
| id | klub_id | ime | prezime |
|----|---------|-----|---------|
| 1852 | NULL | - | Grabovac |
| 1853 | NULL | - | Pilepić |
| 1854 | NULL | - | Maslak |
| 1855 | NULL | - | Jugo |
| 1856 | NULL | - | Miličević |
| 1857 | NULL | - | Marjanović |
| 1858 | NULL | - | Poljak |
| 1859 | NULL | - | Kurelić |
| 2021 | 3896 | - | Mohorić |
| 2125 | NULL | - | Mittrovich (braća) |
| 2130 | NULL | - | Loich |
| 2131 | NULL | - | Paulinich |
| 2132 | NULL | - | Zidarich |
| 2133 | NULL | - | Bertok |
| 2134 | NULL | - | Marincich |
| 2135 | NULL | - | Tiblias |
| 2138 | NULL | - | Veselica |
| 2139 | NULL | - | Naumović |
| 2140 | NULL | - | Osojnak |
| 2141 | NULL | - | Medle |
These look like **historical / surname-only roster entries** (note `napomena`
on id=1852 mentions "POVIJESNI: KK Kvarner najtrofejnija generacija …" so the
cluster is intentional historical data with unknown given name).
**Decision suggestion**: replace `ime='-'` with `ime='?'` is also rejected;
either backfill the given names from a source, mark them inactive/historical
in another column, or accept the data and never enable C3.
---
## C6 — `UNIQUE (klub_id, lower(ime), lower(prezime), COALESCE(datum_rodenja,'0001-01-01'))` — SKIPPED
Conflict groups: **68** (each group has 2+ rows that would collide).
Most are concentrated on **klub_id=2362 (HNK Rijeka roster)** where the same
player appears twice — once with `datum_rodenja IS NULL` and once also with
NULL DOB but a different scrape source / older `id`. Sample:
| klub_id | l_ime | l_prez | dob | dups | ids |
|---------|-------|--------|-----|------|-----|
| 2362 | amer | gojak | NULL | 2 | {3402, 4214} |
| 2362 | leon | šerifi | NULL | 2 | {3334, 4238} |
| 2362 | ante | oreč | NULL | 2 | {1581, 4230} |
| 2362 | branko | pavić | NULL | 2 | {2715, 4231} |
| 2362 | lovro | kitin | NULL | 2 | {3481, 4220} |
| 2362 | ante | majstorović | NULL | 2 | {3456, 4224} |
| 2362 | dejan | petrovič | NULL | 2 | {3399, 4232} |
| 2362 | duje | čop | NULL | 2 | {1579, 4211} |
| 2362 | fran | škalamera | NULL | 2 | {3480, 4239} |
| 2362 | gabriel | rukavina | NULL | 2 | {3404, 4234} |
| 2362 | bruno | bogojević | NULL | 2 | {3437, 4208} |
| 2362 | aleksa | todorović | NULL | 2 | {3455, 4202} |
| 2362 | cherno | saho | NULL | 2 | {3403, 4235} |
| 2362 | luka | menalo | NULL | 2 | {3454, 4226} |
| 2362 | martin | zlomislić | NULL | 2 | {3440, 4203} |
| 2362 | mladen | devetak | NULL | 2 | {3400, 4212} |
| 2362 | niko | janković | NULL | 2 | {3607, 4218} |
| 2362 | noel | bodetić | NULL | 2 | {3705, 4207} |
| 2362 | silvio | ilinković | NULL | 2 | {3412, 4217} |
| 2362 | šimun | butić | NULL | 2 | {3401, 4209} |
| 2362 | stjepan | radeljić | NULL | 2 | {3448, 4233} |
| 2362 | toni | fruk | 2001-03-09 | 2 | {3438, 4135} |
| 2362 | vito | kovač | NULL | 2 | {3298, 4201} |
| 2362 | jovan | manev | NULL | 2 | {3439, 4225} |
| 2585 | ivo | butrica | NULL | 2 | {2282, 4163} |
| 2585 | luko | ledinić | NULL | 2 | {2283, 4164} |
| 2586 | siniša | saftić | NULL | 2 | {2298, 4165} |
| 2587 | damir | poslek | NULL | 2 | {2310, 4167} |
| 2589 | matej | viduka | NULL | 2 | {2340, 4174} |
| 2589 | čedo | vukelić | NULL | 2 | {2339, 4175} |
(38 more groups not shown — query reproduction below.)
**Cause**: the dedup-fold key collapses on `COALESCE(NULL, '0001-01-01')`, so
two records of the same name+klub with missing DOB look identical even when
they are distinct profiles (different `profile_url`, `source_id`, `hns_igrac_id`).
Today's working composite key is the existing `uq_clanovi_klub_profile
(klub_id, profile_url)` which is already enforced.
**Decision suggestion**: do NOT enable C6 as-is. Either (a) restrict the
uniqueness to `WHERE datum_rodenja IS NOT NULL`, or (b) merge true dupes via a
follow-up subagent that promotes one row and back-fills `hns_igrac_id` /
`profile_url`. Until then, ingestion is still protected by
`uq_clanovi_klub_profile` and (for HNS-keyed players) `clanovi_hns_uniq`.
### Reproduce full list
```sql
SELECT klub_id, lower(ime) AS l_ime, lower(prezime) AS l_prez,
COALESCE(datum_rodenja, '0001-01-01'::date) AS dob,
count(*) AS dups,
array_agg(id ORDER BY id) AS ids
FROM pgz_sport.clanovi
GROUP BY klub_id, lower(ime), lower(prezime), COALESCE(datum_rodenja, '0001-01-01'::date)
HAVING count(*) > 1
ORDER BY dups DESC, klub_id;
```