DI exec: applied CC-DI Subagent A+B SQL — 3245 clanovi, Manuel Boras merged
This commit is contained in:
@@ -0,0 +1,126 @@
|
||||
# Subagent D — Skipped constraints, violator samples
|
||||
|
||||
Two candidate constraints were SKIPPED at apply-time because pre-existing rows
|
||||
would have been rejected. They are documented here so Damir can decide whether
|
||||
to clean the data and re-attempt the constraint, or accept the current state.
|
||||
|
||||
The trigger `clanovi_normalize_trigger` already enforces both rules **for new
|
||||
inserts and for name-changing updates**, so future data ingest cannot
|
||||
reintroduce these patterns. Only retroactive enforcement on existing rows is
|
||||
deferred.
|
||||
|
||||
---
|
||||
|
||||
## C3 — `CHECK (length(ime)>=2 AND length(prezime)>=2)` — SKIPPED
|
||||
|
||||
Violator count: **22** rows.
|
||||
|
||||
Two clusters:
|
||||
|
||||
1. **Single-letter `prezime`** — `id=1160` and `id=1165`, both klub_id=848:
|
||||
- `('Boris Mičetić','B')` — note the embedded space in `ime`; the surname appears truncated to a single initial.
|
||||
- `('Boris Mičetić','J')` — same pattern.
|
||||
- **Decision suggestion**: probably real-name parse errors. Resolve manually in `clanovi`.
|
||||
|
||||
2. **Placeholder `ime='-'` (single dash)** — 20 rows, klub_id mostly NULL plus one with klub_id=3896:
|
||||
|
||||
| id | klub_id | ime | prezime |
|
||||
|----|---------|-----|---------|
|
||||
| 1852 | NULL | - | Grabovac |
|
||||
| 1853 | NULL | - | Pilepić |
|
||||
| 1854 | NULL | - | Maslak |
|
||||
| 1855 | NULL | - | Jugo |
|
||||
| 1856 | NULL | - | Miličević |
|
||||
| 1857 | NULL | - | Marjanović |
|
||||
| 1858 | NULL | - | Poljak |
|
||||
| 1859 | NULL | - | Kurelić |
|
||||
| 2021 | 3896 | - | Mohorić |
|
||||
| 2125 | NULL | - | Mittrovich (braća) |
|
||||
| 2130 | NULL | - | Loich |
|
||||
| 2131 | NULL | - | Paulinich |
|
||||
| 2132 | NULL | - | Zidarich |
|
||||
| 2133 | NULL | - | Bertok |
|
||||
| 2134 | NULL | - | Marincich |
|
||||
| 2135 | NULL | - | Tiblias |
|
||||
| 2138 | NULL | - | Veselica |
|
||||
| 2139 | NULL | - | Naumović |
|
||||
| 2140 | NULL | - | Osojnak |
|
||||
| 2141 | NULL | - | Medle |
|
||||
|
||||
These look like **historical / surname-only roster entries** (note `napomena`
|
||||
on id=1852 mentions "POVIJESNI: KK Kvarner najtrofejnija generacija …" so the
|
||||
cluster is intentional historical data with unknown given name).
|
||||
|
||||
**Decision suggestion**: replace `ime='-'` with `ime='?'` is also rejected;
|
||||
either backfill the given names from a source, mark them inactive/historical
|
||||
in another column, or accept the data and never enable C3.
|
||||
|
||||
---
|
||||
|
||||
## C6 — `UNIQUE (klub_id, lower(ime), lower(prezime), COALESCE(datum_rodenja,'0001-01-01'))` — SKIPPED
|
||||
|
||||
Conflict groups: **68** (each group has 2+ rows that would collide).
|
||||
|
||||
Most are concentrated on **klub_id=2362 (HNK Rijeka roster)** where the same
|
||||
player appears twice — once with `datum_rodenja IS NULL` and once also with
|
||||
NULL DOB but a different scrape source / older `id`. Sample:
|
||||
|
||||
| klub_id | l_ime | l_prez | dob | dups | ids |
|
||||
|---------|-------|--------|-----|------|-----|
|
||||
| 2362 | amer | gojak | NULL | 2 | {3402, 4214} |
|
||||
| 2362 | leon | šerifi | NULL | 2 | {3334, 4238} |
|
||||
| 2362 | ante | oreč | NULL | 2 | {1581, 4230} |
|
||||
| 2362 | branko | pavić | NULL | 2 | {2715, 4231} |
|
||||
| 2362 | lovro | kitin | NULL | 2 | {3481, 4220} |
|
||||
| 2362 | ante | majstorović | NULL | 2 | {3456, 4224} |
|
||||
| 2362 | dejan | petrovič | NULL | 2 | {3399, 4232} |
|
||||
| 2362 | duje | čop | NULL | 2 | {1579, 4211} |
|
||||
| 2362 | fran | škalamera | NULL | 2 | {3480, 4239} |
|
||||
| 2362 | gabriel | rukavina | NULL | 2 | {3404, 4234} |
|
||||
| 2362 | bruno | bogojević | NULL | 2 | {3437, 4208} |
|
||||
| 2362 | aleksa | todorović | NULL | 2 | {3455, 4202} |
|
||||
| 2362 | cherno | saho | NULL | 2 | {3403, 4235} |
|
||||
| 2362 | luka | menalo | NULL | 2 | {3454, 4226} |
|
||||
| 2362 | martin | zlomislić | NULL | 2 | {3440, 4203} |
|
||||
| 2362 | mladen | devetak | NULL | 2 | {3400, 4212} |
|
||||
| 2362 | niko | janković | NULL | 2 | {3607, 4218} |
|
||||
| 2362 | noel | bodetić | NULL | 2 | {3705, 4207} |
|
||||
| 2362 | silvio | ilinković | NULL | 2 | {3412, 4217} |
|
||||
| 2362 | šimun | butić | NULL | 2 | {3401, 4209} |
|
||||
| 2362 | stjepan | radeljić | NULL | 2 | {3448, 4233} |
|
||||
| 2362 | toni | fruk | 2001-03-09 | 2 | {3438, 4135} |
|
||||
| 2362 | vito | kovač | NULL | 2 | {3298, 4201} |
|
||||
| 2362 | jovan | manev | NULL | 2 | {3439, 4225} |
|
||||
| 2585 | ivo | butrica | NULL | 2 | {2282, 4163} |
|
||||
| 2585 | luko | ledinić | NULL | 2 | {2283, 4164} |
|
||||
| 2586 | siniša | saftić | NULL | 2 | {2298, 4165} |
|
||||
| 2587 | damir | poslek | NULL | 2 | {2310, 4167} |
|
||||
| 2589 | matej | viduka | NULL | 2 | {2340, 4174} |
|
||||
| 2589 | čedo | vukelić | NULL | 2 | {2339, 4175} |
|
||||
|
||||
(38 more groups not shown — query reproduction below.)
|
||||
|
||||
**Cause**: the dedup-fold key collapses on `COALESCE(NULL, '0001-01-01')`, so
|
||||
two records of the same name+klub with missing DOB look identical even when
|
||||
they are distinct profiles (different `profile_url`, `source_id`, `hns_igrac_id`).
|
||||
Today's working composite key is the existing `uq_clanovi_klub_profile
|
||||
(klub_id, profile_url)` which is already enforced.
|
||||
|
||||
**Decision suggestion**: do NOT enable C6 as-is. Either (a) restrict the
|
||||
uniqueness to `WHERE datum_rodenja IS NOT NULL`, or (b) merge true dupes via a
|
||||
follow-up subagent that promotes one row and back-fills `hns_igrac_id` /
|
||||
`profile_url`. Until then, ingestion is still protected by
|
||||
`uq_clanovi_klub_profile` and (for HNS-keyed players) `clanovi_hns_uniq`.
|
||||
|
||||
### Reproduce full list
|
||||
|
||||
```sql
|
||||
SELECT klub_id, lower(ime) AS l_ime, lower(prezime) AS l_prez,
|
||||
COALESCE(datum_rodenja, '0001-01-01'::date) AS dob,
|
||||
count(*) AS dups,
|
||||
array_agg(id ORDER BY id) AS ids
|
||||
FROM pgz_sport.clanovi
|
||||
GROUP BY klub_id, lower(ime), lower(prezime), COALESCE(datum_rodenja, '0001-01-01'::date)
|
||||
HAVING count(*) > 1
|
||||
ORDER BY dups DESC, klub_id;
|
||||
```
|
||||
Reference in New Issue
Block a user