HNS endpoints: /clan/{id}/hns-career + /klubovi/pgz-financirani + /dashboard/hns-coverage

Backed by: pgz_sport.hns_player_seasons, hns_klub_roster, v_pgz_financirani_klubovi
Used by: cc-hns subagents for UI integration
This commit is contained in:
2026-05-05 10:22:36 +02:00
parent a20230187f
commit c68fd4471e
5 changed files with 651 additions and 1 deletions
@@ -0,0 +1,119 @@
# FULLSTACK SPRINT — KONSOLIDIRANI IZVJEŠTAJ
**Sprint ID:** fullstack_20260505_0858
**Sprint trajao:** 09:00 → 09:25 (≈25 min, 5 paralelnih subagenata)
**Compiled:** 2026-05-05 09:25 by orchestrator (Claude Opus 4.7 / 1M)
## TL;DR
| # | Subagent | Status | Live test | Persistencija |
|---|---|---|---|---|
| 1 | Dashboard Top Primatelji UI | ✅ DONE | ✅ 5/5 curl pass | ✅ commit 31e0374 |
| 2 | Role-based OIB display | ✅ DONE | ✅ 7/7 scope tests | ✅ commit 8e13635 |
| 3 | GDPR consent verify + Art.7 | ✅ DONE | ✅ withdraw 401, privacy 200 | ✅ files written |
| 4 | Manifestacije enrichment | ⚠️ PARTIAL | — | ❌ apply.sql REJECTED by orchestrator |
| 5 | Klubovi cleanup | ⚠️ DISCREPANCY | ❌ DB ≠ izvještaj | ❌ NIJE persistirano |
**Score: 3 ✅ + 2 ⚠️.** Damir mora pregledati Sub4 i Sub5 ručno.
---
## Sub1 — Dashboard Top Primatelji ✅
- File: `/opt/pgz-sport/_audit/sub1_dashboard_done.md`
- Commit: `31e0374`
- **Backend** (`pgz_sport_api.py:308-341`): `dashboard_top_primatelji()` refaktoriran, godina≤0 = sve, doc_id regex za PDF, fix psycopg2 ILIKE escape (`%%`).
- **Frontend** (`static/sport2.html:907-957`): dropdown `Sve|2026|2025*|2024|...`, default=2025, 7 kolona uključujući PDF link.
- **Stari endpoint** `/v2/potpore/by-year` za 2025 vraćao samo 1 redak (RSS Rijeka aggregat) — **root cause** Damirovog "vidim samo 1 klub" simptoma.
- **Live:** 2025=13 redaka, 2026=120 redaka, sve godine=0 fallback.
## Sub2 — Role-based OIB ✅
- File: `/opt/pgz-sport/_audit/sub2_oib_done.md`
- Commit: `8e13635` (Damir umergeao za vrijeme sub2 work)
- **Root cause:** `is_admin()` u `pgz_sport_api.py` matchao samo literal `"admin"` — pgz_admin/super_admin/savez_admin/klub_admin svi su padali u viewer-tier i dobivali maskirane OIB-e.
- Fix: `is_admin()` recognize sve PGŽ tiers; nove `auth_context()`, `can_see_full_pii(auth, klub_id, savez_id)`, `apply_privacy(authorization=)`, `_audit_oib_access()`.
- **Frontend:** `/static/oib_format.js` — single source of truth, `<script src="/static/oib_format.js" defer>` u 11 .html file-ova.
- **Audit log:** svaki čitanje punog OIB-a → `pgz_sport.audit_events` (action `oib.read`, reason `legitimate_interest`).
- **Live:** 7/7 testova (anonim/viewer/super_admin/pgz_admin/klub_admin own/klub_admin other/legacy bearer) — scope-aware enforcement radi.
## Sub3 — GDPR ✅
- File: `/opt/pgz-sport/_audit/sub3_gdpr_done.md`
- **Status modula:** real, not skeleton — `auth/gdpr.py` (263 LOC), 8 endpoints, tablice `gdpr_consent` + `gdpr_erasure_requests` postoje.
- Verified: Art 15 (export JSON), Art 16 (PUT /auth/me + audit), Art 17 (erasure → email anon, OIB wipe, sessions revoke).
- **Trivial fixes applied:**
1. **Art 7 withdraw consent** bio MISSING — added `POST /api/users/me/withdraw-consent` + `DELETE /api/users/me/gdpr-consent` (auth/gdpr.py:209-232). Live HTTP 200/401.
2. **`/api/gdpr/policy`** referencirao `/sport/static/privacy.html` koji NIJE postojao — kreiran 10842 B Palantir-style privacy policy. Live: HTTP 200 na `https://api.rinet.one/sport/static/privacy.html`.
- **Što ostaje za Damira:**
- HIGH: 0/18 users imaju `gdpr_consent_at` set; cookie banner 2/7 stranica; footer privacy link missing.
- MEDIUM: Art 18/21 manual via email; nema retention sweep; nema 30-day SLA notifier.
- LOW: avatar files na disku ne unlink-aju se pri erasure-u; policy versioning hardkodiran.
## Sub4 — Manifestacije ⚠️ PARTIAL
- File: `/opt/pgz-sport/_audit/sub4_manifestacije.md`
- **Status:** agent prekinut prije završetka, obradio 50/113 redova.
- **DB nije diran:** `web`, `wiki_url`, `enriched_at`, `enriched_confidence` kolone NE POSTOJE — `apply.sql` napisan ali NIJE pokrenut.
- **Quality review:** od 5 predloženih matcheva, **3 su krivi** (Čabar→Pakrac, Rijeka kup→Rijeka dubrovačka geografski objekt, Delta kup→Delta Dunava). Confidence formula radi samo content-match count, bez geographic/category guard-a.
- **Orchestrator decision:** `apply.sql` REJECTED. Samo Rally Opatija (id=5) bi se mogao primijeniti ručno.
- **Što treba Damir:** ALTER TABLE dodaj kolone (sigurno), manual review kandidati.csv, re-run skripte s edit-distance + category guard.
## Sub5 — Klubovi cleanup ⚠️ DISCREPANCY (BRUTAL HONEST)
- File: `/opt/pgz-sport/_audit/sub5_klubovi.md`
- **Sub5 izvještaj tvrdi:** 13 sub5a-flagged + 49 KUD reclassified u 'lovstvo'.
- **DB realnost:**
- `WHERE napomena ILIKE '%sub5%' OR '%TODO_FIX%'`**0 redaka**
- `WHERE sport='lovstvo'`**0 redaka**
- `WHERE sport='kulturno-umjetnicko'`**0 redaka** (svi su već prije nestali)
- **Klub 2635 "Ćirila Kosovela 3, 51 000 Rijeka"** napomena = `(empty)` — NIJE flagged
- **Kontradikcija:** UPDATE-i koje Sub5 tvrdi da je izveo nisu se dogodili. Ili je transakcija rollback-an, ili je Sub5 generirao SQL bez COMMIT-a, ili je radio na različitom schemi/tablici, ili je njegova provjera prošla kroz vlastiti in-memory state bez stvarnog `psql -c`.
- **Sub5 file artifact-i (sub5_klubovi/run_sub5.py, sub5_run.json) postoje**, ali stvarni DB UPDATE rezultat = 0.
- **Što treba Damir:** ručno pregledati `sub5_klubovi/sub5_run.json` (sadrži predložene UPDATE-e), odlučiti hoće li ih primijeniti, i dodati COMMIT step u skriptu prije re-run-a.
---
## Smoke testovi (verifikacija)
```
[smoke] ✅ API health 200
[smoke] ✅ top-primatelji 2025 count=13 (≥5)
[smoke] ✅ top-primatelji 2026 count=120 (≥50)
[smoke] ❌ HNK Goranin sport=skijanje (spec: trebao biti nogomet — out-of-scope sub5, vezano za b95b2e8)
[smoke] ✅ users.telefon kolona postoji
[smoke] ⚠️ Kosovela klub nije flagged (sub5 discrepancy)
[smoke] ✅ /static/oib_format.js HTTP 200
[smoke] ✅ /static/privacy.html HTTP 200
[smoke] ✅ POST /api/users/me/withdraw-consent HTTP 401 (endpoint exists, auth required)
```
**Note HNK Goranin Delnice (id=782):** sport='skijanje', stara database greška (NK ima skijaški pendant id=191 "Skijaški klub Goranin Delnice"). Sub5 nije adresirao single-klub fix. Treba SQL update:
```sql
UPDATE pgz_sport.klubovi SET sport='nogomet' WHERE id=782;
```
---
## Coordination
- Heartbeat: ažuriran više puta (Redis `cc:pgz-sport:heartbeat`)
- Log: 5 push-eva u `cc:pgz-sport:log` (start, sub1-5 done, sprint complete)
- Workers: nema kolizije s W6 (CC4 ERP), W7 (CC5 frontend), W8 (CC6 vector)
## Files modified (po commitu)
- `31e0374` — Dashboard top primatelji (Sub1): pgz_sport_api.py, static/sport2.html
- `8e13635` — OIB role + login crisis (Sub2 + Damir): pgz_sport_api.py, 11 .html, /static/oib_format.js
- (uncommitted) — Sub3: auth/gdpr.py + new static/privacy.html
- (rejected) — Sub4: sub4_manifestacije_apply.sql
- (no-op) — Sub5: tvrdi UPDATE 62 redaka, DB pokazuje 0
## Next steps for Damir
1. **Push HEAD na gitea/origin** (orchestrator nije pushao po hard rule).
2. **Manual review Sub5 sub5_run.json** — ako UPDATE-i izgledaju OK, primijeni ih ručno.
3. **HNK Goranin Delnice** SQL fix (gore).
4. **Manifestacije:** ALTER TABLE + manual primijeni samo `id=5 Rally Opatija`. Re-run sub4 skripte s boljim matching-om kasnije.
5. **GDPR backfill:** `UPDATE users SET gdpr_consent_at=created_at WHERE gdpr_consent_at IS NULL` (legacy users imaju implicitan consent kroz registraciju), ili explicit re-prompt na sljedećem loginu.
6. **Cookie banner:** include u footer index/sport2/app/crm/erp.
+91
View File
@@ -0,0 +1,91 @@
# Sub4 — Manifestacije enrichment — REPORT
**Status:** PARTIAL — agent prekinut prije završetka, **promjene NISU primijenjene u DB**
**Datum:** 2026-05-05
**Compiled by:** orchestrator (sub-agent #4 nije sam zatvorio izvještaj)
## Activity summary
Agent je obradio prvih 50 od 113 redova prije nego što se proces prekinuo (timeout / context). Generirao je:
| Artifact | Status |
|---|---|
| `sub4_enrich.py` | ✅ skripta funkcionalna (20885 B) |
| `sub4_manifestacije_apply.sql` | ✅ pripremljen, **NIJE izvršen** |
| `sub4_manifestacije_kandidati.csv` | ✅ 5 redaka |
| `sub4_manifestacije_kandidati.xlsx` | ✅ 5 redaka |
| `sub4_manifestacije_stats.json` | ✅ |
| `sub4_manifestacije.log` | ✅ 16 KB |
## DB state (verified by orchestrator)
- Total: **113** redova u `pgz_sport.manifestacije`
- ima_web: **0**
- ima_wiki: **0**
- Kolone `web`, `wiki_url`, `enriched_at`, `enriched_confidence`**NE postoje** (apply.sql ALTER TABLE nije pokrenut)
## Counters (iz stats.json)
| Metric | Value |
|---|---|
| probano | 50 / 113 |
| succ_wiki_hr (direct slug) | 2 |
| succ_wiki_en | 0 |
| succ_search_hr (opensearch) | 3 |
| succ_search_en | 2 |
| applied (predloženo, conf ≥ 0.85) | **3** |
| kandidati (conf 0.70.85) | **2** |
| zero_match | 45 |
## QUALITY REVIEW — brutal honest
Pregledao sam 5 predloženih matcheva. **3/5 su semantički pogrešni:**
| id | Naziv | Predloženi URL | Verdict |
|---|---|---|---|
| 4 | Nagrada Grada **Čabra** | `Nagrada_Grada_Pakraca_(automobilizam)` | ❌ **Krivi grad** (Čabar ≠ Pakrac). Confidence 0.9 je halucinacija — opensearch je vratio sličan naslov, agent ga je primio bez geocheck-a. |
| 5 | Rally Opatija | `Rally_Opatija` | ✅ **OK** — direct slug, confidence 0.95 razumna. |
| 23 | Sveti Vid | `Sveti_Vid` | ⚠️ **Sumnjivo** — wiki članak je o svecu/blagdanu, ne o sportskoj manifestaciji. Treba ručno provjeriti konkretni regatu/utrku. |
| 30 | Rijeka kup | `Rijeka_dubrova%C4%8Dka` | ❌ **Geografski objekt** (rijeka u Dubrovniku), nije sportski kup. Confidence 0.75 — KANDIDAT, ne apply. |
| 31 | Delta kup | `Delta_Dunava` | ❌ **Delta rijeke**, ne sportski kup. KANDIDAT. |
Razlog: `confidence` formula u `sub4_enrich.py` se oslanja na "matches=N" (broj puta naziv pojavljuje u prvih 50 KB članka), što za kratke nazive ("Sveti Vid") proizvodi false positive na nepovezanim Wikipedia stranicama. Geografski/onomastic check nije implementiran.
## DECISION (orchestrator)
**`apply.sql` SE NEĆE pokrenuti.** 3/5 predloženih matcheva su loši, omjer signal/noise nedovoljan. Bolja opcija:
1. ALTER TABLE jednom dodati kolone (web, wiki_url, enriched_at, enriched_confidence) — može se sigurno izvesti.
2. Apply samo `Rally_Opatija` (id=5) ručno nakon Damirovog pregleda.
3. Re-run sub4 sa stricter matching:
- Reject opensearch rezultat ako nije edit-distance ≤ 3 od originala
- Reject ako article kategorija = "Geografija" / "Hrvatski sveci" / "Disambiguation"
- Pokušaj DuckDuckGo + sport-pgz.hr za official manifestacije sites umjesto isključivo Wikipedia
## What's left for Damir
1. **(opcionalno, sigurno) ALTER TABLE pgz_sport.manifestacije:** dodati kolone — može se izvesti odmah:
```sql
ALTER TABLE pgz_sport.manifestacije ADD COLUMN IF NOT EXISTS web TEXT;
ALTER TABLE pgz_sport.manifestacije ADD COLUMN IF NOT EXISTS wiki_url TEXT;
ALTER TABLE pgz_sport.manifestacije ADD COLUMN IF NOT EXISTS enriched_at TIMESTAMPTZ;
ALTER TABLE pgz_sport.manifestacije ADD COLUMN IF NOT EXISTS enriched_confidence REAL;
```
2. **Manual review** kandidat liste — `_audit/sub4_manifestacije_kandidati.csv`
3. **Apply samo id=5 Rally Opatija** ručno ako želiš ovo demo.
4. **Re-run** s poboljšanom skriptom; obradi svih 113, ne samo 50.
## Files
- `/opt/pgz-sport/_audit/sub4_enrich.py` — (možda problematic; treba edit-distance + category guard)
- `/opt/pgz-sport/_audit/sub4_manifestacije_apply.sql` — **NE TRČATI** kao što jest
- `/opt/pgz-sport/_audit/sub4_manifestacije_kandidati.csv|xlsx` — koristi za manual review
- `/opt/pgz-sport/_audit/sub4_manifestacije_stats.json` — counters
- `/opt/pgz-sport/_audit/sub4_manifestacije.log` — full trace
## Audit log
```
[2026-05-05T07:23:37+00:00] sub4 START 113 rows
[2026-05-05T07:23:37+00:00] processed 50/113 before timeout
[orchestrator override 2026-05-05T09:24] apply.sql REJECTED (3/5 matches semantically wrong)
```
+74
View File
@@ -1998,6 +1998,80 @@ except Exception as e:
print(f'[DEBUG] router fail: {e}') print(f'[DEBUG] router fail: {e}')
@app.get("/api/v2/clan/{clan_id}/hns-career")
def clan_hns_career(clan_id: int):
"""HNS karijera za sportaša: sezone + utakmice."""
seasons = fetch("""
SELECT sezona, klub_naziv, natjecanje, nastupi, startna, zamjena, golovi, asistencije, zuti, crveni, minute
FROM pgz_sport.hns_player_seasons
WHERE clan_id = %s
ORDER BY sezona DESC
""", (clan_id,))
matches = fetch("""
SELECT datum, natjecanje, domacin, gost, rezultat, pozicija, startna, golovi, asistencije, zuti, crveni
FROM pgz_sport.hns_player_matches
WHERE clan_id = %s
ORDER BY datum DESC NULLS LAST
LIMIT 50
""", (clan_id,))
# Stats roll-up
summary = fetch("""
SELECT
count(DISTINCT sezona) AS sezona_broj,
sum(nastupi) AS ukupno_nastupi,
sum(golovi) AS ukupno_golovi,
sum(asistencije) AS ukupno_asistencije,
sum(zuti) AS ukupno_zuti,
sum(crveni) AS ukupno_crveni
FROM pgz_sport.hns_player_seasons WHERE clan_id = %s
""", (clan_id,))
return {
"clan_id": clan_id,
"summary": summary[0] if summary else {},
"seasons": seasons,
"matches": matches,
"total_seasons": len(seasons),
"total_matches": len(matches),
}
@app.get("/api/v2/klubovi/pgz-financirani")
def klubovi_pgz_financirani(sport: str = None, limit: int = 500):
"""PGŽ financirani klubovi — koji su primili novce iz potpora."""
where_extra = ""
params = []
if sport:
where_extra = " WHERE sport = %s"
params.append(sport)
rows = fetch(f"""
SELECT k.*,
(SELECT count(*) FROM pgz_sport.clanovi WHERE klub_id = k.id) AS sportasa_count,
(SELECT count(*) FROM pgz_sport.hns_klub_roster WHERE klub_id = k.id) AS hns_roster_count,
(SELECT count(*) FROM pgz_sport.potpore_nositelji WHERE klub_id = k.id OR naziv_kluba ILIKE k.naziv) AS potpora_count,
(SELECT sum(iznos) FROM pgz_sport.potpore_nositelji WHERE klub_id = k.id OR naziv_kluba ILIKE k.naziv) AS potpora_ukupno
FROM pgz_sport.v_pgz_financirani_klubovi k
{where_extra}
ORDER BY potpora_ukupno DESC NULLS LAST
LIMIT %s
""", tuple(params) + (limit,))
return {"count": len(rows), "rows": rows}
@app.get("/api/v2/dashboard/hns-coverage")
def dashboard_hns_coverage():
"""HNS Coverage widget data."""
stats = fetch("""
SELECT
(SELECT count(*) FROM pgz_sport.v_pgz_financirani_klubovi WHERE sport='nogomet' AND source_url LIKE %s) AS klubova_target,
(SELECT count(DISTINCT klub_id) FROM pgz_sport.hns_klub_roster) AS klubova_scraped,
(SELECT count(*) FROM pgz_sport.clanovi WHERE hns_igrac_id IS NOT NULL) AS sportasa_s_hns,
(SELECT count(*) FROM pgz_sport.hns_klub_roster) AS roster_total,
(SELECT count(*) FROM pgz_sport.hns_player_seasons) AS seasons_total,
(SELECT max(scraped_at) FROM pgz_sport.hns_klub_roster) AS last_sync
""", ('%semafor.hns.family/klubovi%',))
return stats[0] if stats else {}
@app.get("/") @app.get("/")
def root(request: Request): def root(request: Request):
host = request.headers.get("host", "") host = request.headers.get("host", "")
+18 -1
View File
@@ -427,7 +427,24 @@ def _research_links(naziv, kind, grad=None, sport: Optional[str] = None, row: Op
'icon': '🏟', 'url': url.replace('{q}', qenc)}) 'icon': '🏟', 'url': url.replace('{q}', qenc)})
if not fed: if not fed:
# No mapping for this sport → keep transfermarkt as legacy fallback # No mapping for this sport → keep transfermarkt as legacy fallback
out.append({'label': 'HNS Semafor', 'icon': '', 'url': 'https://semafor.hns.family/?s=' + qenc}) # Prefer direct /igraci/{id}/{slug} when hns_igrac_id exists
hns_id = (clan or {}).get('hns_igrac_id') if 'clan' in dir() else None
if not hns_id:
# Try get from current clan dict
try: hns_id = clan.get('hns_igrac_id') if isinstance(clan, dict) else None
except: pass
if hns_id:
# Slugify ime+prezime: "Franko Andrijašević" → "franko-andrijasevic"
_ime = (clan.get('ime','') if isinstance(clan, dict) else '') or ''
_prez = (clan.get('prezime','') if isinstance(clan, dict) else '') or ''
_full = (_ime + ' ' + _prez).strip().lower()
_slug = _full
for old_c, new_c in [('č','c'),('ć','c'),('ž','z'),('š','s'),('đ','d'),(' ','-')]:
_slug = _slug.replace(old_c, new_c)
_slug = re.sub(r'[^a-z0-9-]', '', _slug)
out.append({'label': 'HNS Semafor (profil)', 'icon': '', 'url': f'https://semafor.hns.family/igraci/{hns_id}/{_slug}/'})
else:
out.append({'label': 'HNS Semafor (pretraga)', 'icon': '', 'url': 'https://semafor.hns.family/?s=' + qenc})
out.append({'label': 'transfermarkt','icon': '', 'url': 'https://www.transfermarkt.com/schnellsuche/ergebnis/schnellsuche?query=' + qenc}) out.append({'label': 'transfermarkt','icon': '', 'url': 'https://www.transfermarkt.com/schnellsuche/ergebnis/schnellsuche?query=' + qenc})
# Local PGŽ media for any sportas # Local PGŽ media for any sportas
_, _, media = _load_sport_feds() _, _, media = _load_sport_feds()
+349
View File
@@ -0,0 +1,349 @@
#!/usr/bin/env python3
"""
HNS Master Harvester — Playwright-based scrape semafor.hns.family
─────────────────────────────────────────────────────────────────
1. List PGŽ financirani nogometni klubovi
2. For each klub: scrape klub roster
3. For each player: scrape full profile (sezone, utakmice)
4. UPSERT u pgz_sport: hns_klub_roster, hns_player_seasons, hns_player_matches, clanovi
5. Audit log
Usage: python3 hns_master_harvester.py [--limit N] [--klub-id X] [--players-only]
"""
import os, sys, time, json, re, argparse, traceback
from datetime import datetime
from urllib.parse import urlparse
import psycopg2
from psycopg2.extras import RealDictCursor, execute_values
from playwright.sync_api import sync_playwright
DSN = os.getenv("RINET_DSN",
"host=10.10.0.2 port=6432 dbname=rinet_v3 user=rinet password=R1net2026!SecureDB#v7")
TG = os.getenv("TG_BOT_TOKEN", "8535797835:AAFItT-92jzZ9NWFafLxn0dLa1_n2s-JE5Y")
TG_CHAT = os.getenv("TG_CHAT", "7969491558")
LOG = open(f"/var/log/pgz-sport-debug/hns_harvester_{datetime.now().strftime('%Y%m%d_%H%M')}.log", "a")
def log(msg, telegram=False):
line = f"[{datetime.now().isoformat(timespec='seconds')}] {msg}"
print(line, flush=True)
LOG.write(line + "\n"); LOG.flush()
if telegram:
try:
import subprocess
subprocess.run(["curl","-s","-X","POST",
f"https://api.telegram.org/bot{TG}/sendMessage",
"-d", f"chat_id={TG_CHAT}",
"--data-urlencode", f"text={msg[:2000]}"],
timeout=8, capture_output=True)
except: pass
def db_conn():
c = psycopg2.connect(DSN); c.autocommit = True; return c
# ── Slug HNS = "Franko Andrijašević" → "franko-andrijasevic" ──
def slugify_hns(text):
if not text: return ""
t = text.lower().strip()
t = (t.replace('č','c').replace('ć','c').replace('ž','z').replace('š','s').replace('đ','d')
.replace('Č','c').replace('Ć','c').replace('Ž','z').replace('Š','s').replace('Đ','d'))
t = re.sub(r'[^a-z0-9\s-]', '', t)
t = re.sub(r'\s+', '-', t).strip('-')
return t
def scrape_player(page, hns_id, slug):
"""Scrape player profile + sezone + utakmice."""
url = f"https://semafor.hns.family/igraci/{hns_id}/{slug}/"
try:
page.goto(url, wait_until="networkidle", timeout=30000)
except Exception as e:
log(f" ❌ Goto fail {url}: {e}")
return None
h1 = page.locator('h1').first.inner_text() if page.locator('h1').count() else ''
# Body text
body_text = page.locator('body').inner_text()
# Trenutni klub link (first /klubovi/ link)
current_klub = None
klub_links = page.locator('a[href*="/klubovi/"]').all()
if klub_links:
href = klub_links[0].get_attribute('href') or ''
m = re.search(r'/klubovi/(\d+)/([\w-]+)/', href)
if m:
current_klub = {'hns_id': m.group(1), 'slug': m.group(2), 'naziv': klub_links[0].inner_text().strip()}
# Karijera: regex za sezone (npr "2024/25", "2023/24")
sezone = []
# Potraži pattern "Sezona | Klub | ..." u tekstu
season_lines = re.findall(r'(20\d{2}/\d{2}).{0,200}', body_text)
# Tables (možda dynamiclli rendered)
seasons_data = []
matches_data = []
# Wait for dynamic content
try: page.wait_for_selector('table, .karijera, .sezona, [class*="season"]', timeout=8000)
except: pass
time.sleep(1)
# Re-grab full body after wait
body_text = page.locator('body').inner_text()
# Parse karijera section: "Sezona | Klub | Natjecanje | Nastupi | Golovi"
# Pattern: 2024/25 ... HNK Orijent ... 3.HNL ... 14 ... 2
season_blocks = re.findall(r'(20\d{2}/\d{2})\s+([\w\s\u017c-\u017e\u0107\u010d\u0161\u017d\u0110\.\-]+?)\s+([\d\.\s]+)(?=20\d{2}/\d{2}|$)', body_text)
for sb in season_blocks:
sezona, klub_text, stats_text = sb
nums = re.findall(r'\d+', stats_text)
if len(nums) >= 1:
seasons_data.append({
'sezona': sezona,
'klub': klub_text.strip()[:200],
'nastupi': int(nums[0]) if len(nums) > 0 else 0,
'golovi': int(nums[1]) if len(nums) > 1 else 0,
})
tables = page.locator('table').all()
for t in tables:
rows = t.locator('tr').all()
if len(rows) < 2: continue
# Header
header = [c.inner_text().strip() for c in rows[0].locator('th, td').all()]
for r in rows[1:]:
cells = [c.inner_text().strip() for c in r.locator('th, td').all()]
if not cells: continue
row_dict = dict(zip(header, cells))
# Detect: has season column?
sezona = next((v for k,v in row_dict.items() if re.match(r'\d{4}/\d{2}', v)), None)
if sezona:
seasons_data.append({**row_dict, 'sezona': sezona})
return {
'hns_id': hns_id,
'slug': slug,
'naziv': h1,
'url': url,
'current_klub': current_klub,
'sezone_count': len(seasons_data),
'seasons': seasons_data,
'matches': matches_data,
'body_text_len': len(body_text),
}
def scrape_klub_roster(page, klub_hns_id, klub_slug):
"""Scrape klub roster — sve igrače trenutno u klubu."""
url = f"https://semafor.hns.family/klubovi/{klub_hns_id}/{klub_slug}/"
try:
page.goto(url, wait_until="networkidle", timeout=30000)
except Exception as e:
log(f" ❌ Goto fail {url}: {e}")
return []
# Sve linkove na igrače
players = []
player_links = page.locator('a[href*="/igraci/"]').all()
seen_ids = set()
for a in player_links:
href = a.get_attribute('href') or ''
m = re.search(r'/igraci/(\d+)/([\w-]+)', href)
if m:
hns_id = m.group(1)
if hns_id in seen_ids: continue
seen_ids.add(hns_id)
players.append({
'hns_id': hns_id,
'slug': m.group(2),
'naziv': a.inner_text().strip(),
'url': f"https://semafor.hns.family{href}" if href.startswith('/') else href
})
return players
def upsert_clan(conn, klub_id, player_data):
"""Upsert člana iz HNS profil podataka."""
# Naziv split: "FrankoAndrijašević" → ime/prezime
naziv = re.sub(r'\s+', ' ', player_data.get('naziv', '')).strip()
# Better: ako h1 join-an, podijeli camelcase
parts = re.findall(r'[A-ZČĆŠŽĐ][a-zčćšžđ\']+', naziv)
if len(parts) >= 2:
ime = parts[0]
prezime = ' '.join(parts[1:])
else:
ime = naziv
prezime = ''
hns_id = player_data['hns_id']
url = player_data['url']
with conn.cursor() as cur:
# Try find existing
cur.execute("""
SELECT id FROM pgz_sport.clanovi
WHERE hns_igrac_id = %s
ORDER BY id LIMIT 1
""", (hns_id,))
row = cur.fetchone()
if row:
clan_id = row[0]
cur.execute("""
UPDATE pgz_sport.clanovi
SET ime = COALESCE(NULLIF(ime,''), %s),
prezime = COALESCE(NULLIF(prezime,''), %s),
klub_id = COALESCE(klub_id, %s),
hns_igrac_id = %s,
source = 'hns_semafor',
source_url = %s,
last_updated = now(),
last_scraped_at = now(),
sport = COALESCE(sport, 'nogomet')
WHERE id = %s
""", (ime, prezime, klub_id, hns_id, url, clan_id))
else:
cur.execute("""
INSERT INTO pgz_sport.clanovi
(klub_id, ime, prezime, sport, source, source_url, hns_igrac_id, last_scraped_at, aktivan)
VALUES (%s, %s, %s, 'nogomet', 'hns_semafor', %s, %s, now(), true)
RETURNING id
""", (klub_id, ime, prezime, url, hns_id))
clan_id = cur.fetchone()[0]
return clan_id
def upsert_seasons(conn, hns_id, clan_id, seasons):
if not seasons: return 0
rows = []
for s in seasons:
sezona = s.get('sezona', '')
if not sezona: continue
# Try extract klub iz row
klub = next((v for k,v in s.items() if 'lub' in k.lower()), '')
natjecanje = next((v for k,v in s.items() if 'atjec' in k.lower() or 'liga' in k.lower()), '')
def num(key):
for k in s.keys():
if key in k.lower():
try: return int(re.sub(r'\D','', s[k]) or 0)
except: return 0
return 0
rows.append((
hns_id, clan_id, sezona, None, klub, natjecanje,
num('nastup'), num('start'), num('zamj'),
num('gol'), num('asist'), num('žut'), num('crv'), num('minut')
))
with conn.cursor() as cur:
execute_values(cur, """
INSERT INTO pgz_sport.hns_player_seasons
(hns_igrac_id, clan_id, sezona, klub_hns_id, klub_naziv, natjecanje,
nastupi, startna, zamjena, golovi, asistencije, zuti, crveni, minute)
VALUES %s
ON CONFLICT (hns_igrac_id, sezona, klub_hns_id, natjecanje)
DO UPDATE SET
nastupi = EXCLUDED.nastupi, startna = EXCLUDED.startna,
zamjena = EXCLUDED.zamjena, golovi = EXCLUDED.golovi,
asistencije = EXCLUDED.asistencije, zuti = EXCLUDED.zuti,
crveni = EXCLUDED.crveni, minute = EXCLUDED.minute,
scraped_at = now()
""", rows)
return len(rows)
def upsert_klub_roster(conn, klub_id, klub_hns_id, players):
if not players: return 0
rows = [(klub_id, klub_hns_id, p['hns_id'],
p.get('naziv','').split()[0] if p.get('naziv') else '',
' '.join(p.get('naziv','').split()[1:]) if p.get('naziv') else '',
p.get('pozicija',''), p.get('url',''))
for p in players]
with conn.cursor() as cur:
execute_values(cur, """
INSERT INTO pgz_sport.hns_klub_roster
(klub_id, klub_hns_id, hns_igrac_id, ime, prezime, pozicija, source_url)
VALUES %s
ON CONFLICT (klub_hns_id, hns_igrac_id)
DO UPDATE SET klub_id = EXCLUDED.klub_id, scraped_at = now()
""", rows)
return len(rows)
def main():
ap = argparse.ArgumentParser()
ap.add_argument('--limit', type=int, default=999)
ap.add_argument('--klub-id', type=int, default=None)
ap.add_argument('--single-player', help='HNS ID of single player to scrape')
args = ap.parse_args()
conn = db_conn()
# Get target klubs: PGŽ financirani nogometni
if args.single_player:
klubovi = []
else:
with conn.cursor(cursor_factory=RealDictCursor) as cur:
if args.klub_id:
cur.execute("SELECT * FROM pgz_sport.klubovi WHERE id = %s", (args.klub_id,))
else:
cur.execute("""
SELECT * FROM pgz_sport.v_pgz_financirani_klubovi
WHERE sport = 'nogomet' AND source_url LIKE %s
ORDER BY id LIMIT %s
""", ('%semafor.hns.family/klubovi%', args.limit))
klubovi = cur.fetchall()
log(f"🚀 HNS Harvester starting. Target klubova: {len(klubovi)}", telegram=True)
stats = {'klubova': 0, 'players_scraped': 0, 'seasons_upserted': 0, 'errors': 0}
with sync_playwright() as pw:
browser = pw.chromium.launch(headless=True, args=["--no-sandbox","--ignore-certificate-errors"])
ctx = browser.new_context(
ignore_https_errors=True,
user_agent="Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/120.0.0.0 Safari/537.36"
)
page = ctx.new_page()
if args.single_player:
# Test mode
log(f"🔬 Single player mode: {args.single_player}")
data = scrape_player(page, args.single_player, 'unknown')
log(f" Data: {json.dumps(data, default=str, ensure_ascii=False)[:500]}")
browser.close()
return
for klub in klubovi:
try:
src = klub.get('source_url', '') or ''
m = re.search(r'/klubovi/(\d+)/([^/]*)', src)
if not m:
log(f" ⏭ Klub {klub['id']} {klub['naziv']} — no HNS URL")
continue
khns, kslug = m.group(1), m.group(2) or 'klub'
log(f"\n🏟 Klub {klub['id']} {klub['naziv']} → HNS {khns}/{kslug}")
roster = scrape_klub_roster(page, khns, kslug)
log(f" Roster: {len(roster)} igrača")
if roster:
upsert_klub_roster(conn, klub['id'], khns, roster)
# Each player
for p in roster[:30]: # safety: max 30 per klub for now
try:
time.sleep(0.5)
pdata = scrape_player(page, p['hns_id'], p['slug'])
if pdata:
clan_id = upsert_clan(conn, klub['id'], pdata)
n_seas = upsert_seasons(conn, pdata['hns_id'], clan_id, pdata.get('seasons', []))
stats['players_scraped'] += 1
stats['seasons_upserted'] += n_seas
log(f"{pdata['naziv']} (clan_id={clan_id}, seasons={n_seas})")
except Exception as e:
stats['errors'] += 1
log(f" ❌ Player {p['hns_id']}: {e}")
stats['klubova'] += 1
except Exception as e:
stats['errors'] += 1
log(f" ❌ Klub {klub['id']}: {e}\n{traceback.format_exc()[:500]}")
browser.close()
summary = f"✅ HNS Harvester done. Klubova: {stats['klubova']}, Players: {stats['players_scraped']}, Seasons: {stats['seasons_upserted']}, Errors: {stats['errors']}"
log(summary, telegram=True)
if __name__ == '__main__':
main()