31 KiB
HANDOFF v2 — Sport Scraping Pipeline za PGZ
Datum: 03.05.2026 14:25 | Verzija: 9.0 (ULTRA-DETAILED) | Autor: Damir + Claude
0. TL;DR za novu sesiju
Damir gradi Palantir za PGZ sport (kasnije cijela Hrvatska):
- 525+ klubova iz svih saveza (HVS, HNS, HRS, HKS, HOS, HBS, HSS, HJS, HJK, HAS, HKaratS, HTS, HSTS, HSA, HOO)
- Sve osobe + sve uloge (igrac, predsjednik, tajnik, trener, fizioterapeut, lijecnik, team manager, sudac)
- Identifikator za svakog (OIB ili
external_id) - Inteligentno fuzzy dedup (BEZ 5x istog kluba pod razlicitim imenom)
- Postojeca arhitektura, NE uvodi novo
- BRUTALNO EFIKASAN, bez podataka su nista
3-strike rule active. Damir hvata lazi -> manual mode. Vec smo iskoristili 2 strike-a.
1. STACK STATE (verified 14:25)
Aktivni servisi
| Servis | Port | Process | Status |
|---|---|---|---|
| vLLM Qwen 2.5-7B AWQ | 8001 | vllm.entrypoints.openai.api_server (PID 3642939) | LIVE, GPU 40% util |
| BGE-M3 Embedder | 9879 | /opt/rinet-gpu/embed_service.py (PID 3742571) | LIVE, 1024-dim |
| Qdrant | 6333 | docker-proxy | LIVE, 30+ collections |
| PostgreSQL | 5432 | postgres | LIVE |
| PgBouncer | 6432 | pgbouncer | LIVE |
| Ollama | 11434 | ollama (PID 3643099) | LIVE ali $HOME panic |
| F10 LoRA | 8765 | python3 (PID 2383543) | LIVE ali endpoint 404 - BROKEN |
GPU stanje (vazno!)
- 18209 / 20475 MiB (89% pun)
- vLLM koristi gpu-memory-utilization 0.40 (~8 GB)
- Embedder koristi ostalo (~10 GB)
- NE pokrecati 2. LLM ili novi embed batch - FILE LOCK!
Korisni endpoint primjeri (TESTIRANI):
vLLM:
curl -X POST http://localhost:8001/v1/chat/completions \
-H "Content-Type: application/json" \
-d \'{"model":"Qwen/Qwen2.5-7B-Instruct-AWQ","messages":[{"role":"user","content":"OK"}],"max_tokens":5}\'
# Response: {"choices":[{"message":{"content":"OK"}}]}
Embedder (input MORA biti list!):
curl -X POST http://localhost:9879/api/embeddings \
-H "Content-Type: application/json" \
-d \'{"model":"bge-m3","input":["VK Primorje"]}\'
# Response: {"embedding":[-0.0175,0.0243,...]} # 1024-dim
Ako embedder ne odgovara:
systemctl restart rinet-embedder # ili sl. service name
2. CREDENTIALS / ENV
# Bridge API (svaki server-side komand)
BRIDGE_URL=https://api.rinet.one/bridge/exec
BRIDGE_KEY=rinet-yS4ZnKlwUqsjk
# Body: {"cmd":"..."} param je `cmd` ne `command`!
# DB
DB_HOST=localhost
DB_PORT=6432 # PgBouncer
DB_DIRECT=5432
DB_NAME=rinet_v3
DB_USER=rinet
DB_PASS=R1net2026!SecureDB#v7
# Cloud LLM (fallback ako vLLM zakaže)
GROQ_API_KEY=SET (u /opt/rinet-gpu/.env.master)
DEEPSEEK_API_KEY=SET
GROQ_MODEL=llama-3.3-70b-versatile # free tier
DEEPSEEK_MODEL=deepseek-chat # $0.14/M
# Sudreg OAuth
SUDREG_CLIENT_ID=SET (u .env.master)
SUDREG_CLIENT_SECRET=SET
TOKEN_URL=https://sudreg-data.gov.hr/api/oauth/token
GRANT_TYPE=client_credentials
# HVS rezultati JWT (produces 795 PGZ vaterpolo persons!)
HVS_TOKEN=eyJ0eXAiOiJKV1QiLCJhbGciOiJIUzI1NiJ9.eyJpc3MiOiJIVlMiLCJpYXQiOjE1Njg3MjI1MzEsImV4cCI6MTY1NTM0NTIyMywiYXVkIjoiSFZTIiwic3ViIjoiSFZTIiwiR2l2ZW5OYW1lIjoiSFZTIn0.iteM3Hewl0ugQVEDqPdg_7hHwRTxnSeFVg59vPH25uU
# Telegram bot
TG_BOT=8535797835:AAFItT-92jzZ9NWFafLxn0dLa1_n2s-JE5Y
TG_CHAT=7969491558
3. HVS API BREAKTHROUGH (replicirat za druge saveze!)
Discovery
Damir je dao primjere https://hvs.hr/igrac/6893/ (Matija Curic) i /igrac/43303/. Network sniff preko Playwright otkrio internal API.
API spec
POST https://rezultati.hvs.hr/api/?rubrika={rubrika}&id={id}
Headers:
Content-Type: application/x-www-form-urlencoded
Origin: https://hvs.hr
Referer: https://hvs.hr/
User-Agent: Mozilla/5.0 ...
Body: token=<HVS_TOKEN>
Rubrike (testirane):
| Rubrika | Output | Sample |
|---|---|---|
person&id={pid} |
osoba detail + suigraci | {data:{fname,lname,club,date,position,games,goals,saves,height,weight}, players3:{...}} |
team&id={tid} |
tim info + utakmice + statistika | {info:{name,gender,category,logo}, utakmice:[10], stat:{A,B}} |
persongames&limit=N&id={pid} |
utakmice osobe | (testirat) |
sticky&id={id_or_keyword} |
highlights / news | (testirat) |
klubovi, clanovi, momcadi, roster, sastav, igraci |
-> [] empty |
NE rade |
Performans
50K player ID-eva u ~50 sekundi (parallel=20). 795 PGZ vaterpolo osoba unesene u DB.
Skripta: /tmp/hvs_mass.py (saved on server).
Replikacija pattern za druge saveze
rezultati.hbs.hr(boćanje) - PROBE!rezultati.hks-cbf.hr(košarka) - PROBE!rezultati.hns.hrilirezultati.hns-cff.hr(nogomet) - PROBE!rezultati.hjs.hr(jedrenje) - PROBE!rezultati.has.hr(atletika) - PROBE!rezultati.hrs.hr-> Zoraxy gateway, skip
Method: Playwright -> sync -> page.goto klub stranica -> on_request capture -> filter rezultati.X.hr -> extract token from POST body.
4. DAMIROV SEED LIST (zlato vrijedno!)
4.1 Krovni / zupanijski savezi
Zajednica sportova PGZ: https://sport-pgz.hr/ (scraped: 18 godisnjaka)
Sportska zajednica Rijeke: http://rss.hr/
Nogometni savez PGZ: https://www.nspgz.hr/ (TODO: registar PGZ klubova)
Sahovski savez PGZ: https://www.sah-pgz.hr/ KLUB LIST: /klubovi/
Kosarkaski savez PGZ: https://kspgz.hr/ (NEW iz CSV)
Rukometni savez PGZ: https://rspgz.hr/ (NEW iz CSV)
Bocarski savez PGZ: https://bspgz.hr/ (NEW iz CSV)
Ski savez PGZ: https://ski-pgz.hr/ (NEW iz CSV)
Pikado savez PGZ: http://www.pikado-pgz.hr
Parasport Rijeka: https://www.ssoi-rijeka.hr/
Parasport PGZ: https://ssoi-pgz.hr/
Skolski sport PGZ: http://www.savezssd-pgz.hr/
4.2 Nacionalni savezi
HOO: https://www.hoo.hr/ (1447 docs scraped)
HNS: https://hns.family/ (registar /hr/sportski-djelatnici/)
HRS: https://hrs.hr/ (api.hrs.hr blokiran Zoraxy)
HKS-CBF: https://www.hks-cbf.hr/ ili https://hks-cbf.hr/ (natjecanja.hks-cbf.hr live)
HVS: https://hvs.hr/ (DONE 795 osoba)
HOS-CVF: https://hos-cvf.hr/ (odbojka)
HBS-bocanje: https://hrvatski-bocarski-savez.hr/ (CSV ima drugaciji od mojeg `hbs.hr`!)
HSS: https://hrvatski-sahovski-savez.hr/ (sah)
HSSRM: https://www.hssrm.hr/ (sport ribolov na moru)
4.3 PGZ klubovi s URL-om (po sportu)
Nogomet
HNK Rijeka: https://nk-rijeka.hr/
HNK Orijent: https://nk-orijent.hr/ (DB ima drugi URL!)
NK Krk: https://nkkrk.hr/
NK Opatija: https://nkopatija.hr/ (DB ima Wiki URL - FIX!)
NK Grobnican: https://nk-grobnican.hr/
NK Pomorac: https://nk-pomorac.hr/
NK Naprijed: https://nk-naprijed.hr/
NK Turbina Tribalj: http://www.nk-turbina-tribalj.hr/
NK Klana: https://nk-klana.hr/
NK Mune: https://nk-mune.hr/ (DB ima Wiki - FIX!)
NK Crikvenica: https://nk-crikvenica.hr/ (DB ima semafor - FIX!)
Rukomet
RK Kozala: https://rk-kozala.hr/
RK Zamet: https://rk-zamet.hr/
ZRK Zamet: https://zrk-zamet.hr/
RK Viskovo: https://rk-viskovo.hr/
RK Omisalj: https://rk-omisalj.hr/
RK Murvica: https://rk-murvica.hr/
Kosarka
KK Kvarner 2010: https://kkkvarner2010.hr/ (PAZI: NIJE AK Kvarner!)
KK Skrljevo: https://kk-skrljevo.hr/
KK Ri-Basket: https://ri-basket.hr/
KK Kastav: https://kkkastav.hr/ (ima i nk-kastav.hr crime!)
KK Kraljevica: https://kk-kraljevica.hr/
Vaterpolo
VK Primorje EB: https://vaterpolo-primorje.hr/
SD Primorje 08: http://www.primorje08.hr
Plivanje
PK Primorje: https://pk-primorje.hr/
Atletika
AK Kvarner: https://akkvarner.hr/
Odbojka
HAOK Rijeka: https://haok-rijeka.hr/
Sah
LIST URL: https://www.sah-pgz.hr/klubovi/ <-- SCRAPE OVO!
Najveci: SK Rijeka, SK Kvarner, SK Crikvenica, SK Losinj, SK Kastav, SK Krk, SK Viskovo
4.4 Lige / natjecanja
SuperSport HNL: https://hnl.hr/
SuperSport Prva NL: https://hns.family/natjecanja/prva-nl/
Druga NL: https://hns.family/natjecanja/druga-nl/
Treca NL Zapad: https://hns.family/
Favbet Premijer (kos.): https://premijerliga.hks-cbf.hr/
Prva muska liga (kos.): https://hks-cbf.hr/
Paket24 Premijer (ruk.): https://hrs.hr/natjecanja/
1. HRL (rukomet): https://hrs.hr/
Prvenstvo HR vaterpolo: https://hvs.hr/
Hrvatska sahovska liga: https://hrvatski-sahovski-savez.hr/
4.5 Statistika / rezultati portali
Rezultati.com: https://www.rezultati.com/ # svi sportovi
SofaScore: https://www.sofascore.com/hr/ # detalji + ocjene
SportCom: https://www.sportcom.hr/ # PGZ lokalni
HNS Semafor: https://semafor.hns.family/ # nogomet zapisnici!
Transfermarkt: https://www.transfermarkt.com/ # transfers
Sportalo: https://www.sportalo.hr/ # nize lige
HKS Natjecanja: https://natjecanja.hks-cbf.hr/ # kosarka rezultati
Eurobasket: https://www.eurobasket.com/ # baza kosarkasa
HRS Natjecanja: https://hrs.hr/natjecanja/ # rukomet
HVS Natjecanja: https://hvs.hr/natjecanja/ # vaterpolo
5. CURRENT DB STATE (verified)
pgz_sport.klubovi (525 PGZ aktivnih)
503 BEZ web URL-a <-- glavna prepreka!
19 Wikipedia URL <-- treba prebaci na pravi
11 pravi web URL OK
+9 updated u ovoj sesiji
pgz_sport.clanovi (1808 PGZ ukupno)
527 vaterpolo igrac (HVS API)
405 bocanje igrac (HBS scrape, fixed kategorija)
745 nogomet+ostali (mass enrich, kategorija ili igrac ili NULL)
6 uprava+stozer
pgz_sport.savezi (15+ aktivnih)
Vaterpolski savez PGZ id=28 (merged od 34) | Damir Glavan
HVS id=? | Mladen Drnasin
HBS, HKS, HRS, HOO, HNS, HSS, HSSRM (insert iz seed)
pgz_sport.dokumenti (3379)
1447 hoo (Damirov scraper)
288 pgz_sport
260 rss_hr
294 savez_hbs
106 savez_hks
73 pravilnik
+18 godisnjaci sport-pgz.hr (NEW - 2006-2022, 9M znakova ukupno)
Schema dodaci u OVOJ sesiji (ALREADY DONE)
ALTER TABLE pgz_sport.clanovi
ADD COLUMN external_id TEXT, -- "hvs:igrac:6893"
ADD COLUMN savez_izvor TEXT, -- "HVS","HBS","godisnjak","klub_web"
ADD COLUMN profile_url TEXT,
ADD COLUMN uloga_detalj TEXT, -- "trener vrata", "lijevo krilo"
ADD COLUMN metadata JSONB;
CREATE INDEX idx_clanovi_external_id ON clanovi(external_id);
CREATE INDEX idx_clanovi_savez_izvor ON clanovi(savez_izvor);
ALTER TABLE pgz_sport.clanovi
ADD CONSTRAINT uq_clanovi_klub_profile UNIQUE (klub_id, profile_url);
CREATE TABLE pgz_sport.uloga_katalog (
id SERIAL PRIMARY KEY,
kod TEXT UNIQUE,
naziv TEXT,
grupa TEXT, -- uprava, clanstvo, sportasi, strucni_stozer, organizacijski,
-- medicinski_stozer, tehnicki_stozer, sudac_kvalifikacija, medijski, ostalo
redoslijed INTEGER
);
-- Insertano 49 uloga: predsjednik, dopredsjednik, tajnik, blagajnik, clan_uprave, clan_nadzornog_odbora,
-- clan_skupstine, direktor, sportski_direktor, clan, pocasni_clan, osnivac, igrac, sportas, reprezentativac,
-- mladi_sportas, veteran, trener, glavni_trener, pomocni_trener, trener_vratara, kondicioni_trener,
-- selektor, izbornik, analiticar, skaut, team_manager, voditelj, koordinator, lijecnik, fizioterapeut,
-- kineziolog, nutricionist, psiholog, maser, tehnicar, ekonom, vozac, cuvar, sudac, zapisnicar,
-- mjeritelj, delegate, komisar, press_officer, foto, voditelj_marketinga, volonter, fan_club
6. SCHEMA FIX-OVI ZA NOVU SESIJU (HITNO)
-- ALTER da seed_pgz.py prolazi
ALTER TABLE pgz_sport.klubovi ADD COLUMN IF NOT EXISTS izvor_unosa TEXT;
ALTER TABLE pgz_sport.natjecanja ADD COLUMN IF NOT EXISTS razina_natjecanja TEXT;
ALTER TABLE pgz_sport.natjecanja ADD COLUMN IF NOT EXISTS web TEXT;
-- Crime list - krivi URL-ovi za fix:
UPDATE pgz_sport.klubovi SET web=\'https://nkopatija.hr/\' WHERE id=3840 AND web LIKE \'%wikipedia%\';
UPDATE pgz_sport.klubovi SET web=\'https://nk-mune.hr/\' WHERE id=2201 AND web LIKE \'%wikipedia%\';
UPDATE pgz_sport.klubovi SET web=\'https://nk-crikvenica.hr/\' WHERE id=2421 AND web LIKE \'%semafor%\';
UPDATE pgz_sport.klubovi SET web=\'https://akkvarner.hr/\' WHERE id=3746 AND naziv=\'AK Kvarner\';
-- Insertat 27 klubova iz seed list (provjeri prije svakog INSERT je li vec postoji s OIB ili fuzzy naziv)
7. ENTITY DEDUP STRATEGIJA (Damirov hard-blocking issue!)
Problem: u DB ima VK Primorje, VK Primorja (typo), VK Primorje EB, Vaterpolski klub Primorje Erste Bank - sve isti klub.
Multi-step pipeline:
# Step 1: Naive normalizacija
def normalize(naziv):
s = unicodedata.normalize(\'NFKD\', naziv).encode(\'ascii\',\'ignore\').decode().lower()
s = re.sub(r\'[^a-z0-9 ]+\', \' \', s)
s = re.sub(r\'\\b(klub|sportski|sportsko|udruga|drustvo|sd|hnk|nk|vk|kk|rk|ak|jk|tk|stk|bk|sk|hkk|mnk)\\b\', \'\', s)
return \' \'.join(s.split())
# Step 2: RapidFuzz (pip install rapidfuzz)
from rapidfuzz import fuzz
score = fuzz.token_set_ratio(norm_a, norm_b) # > 85 = candidate
# Step 3: BGE-M3 cosine similarity
import requests, numpy as np
def embed(texts):
r = requests.post(\'http://localhost:9879/api/embeddings\',
json={\'model\':\'bge-m3\', \'input\':texts}).json()
return np.array(r[\'embedding\']) if isinstance(r.get(\'embedding\'), list) and not isinstance(r[\'embedding\'][0], list) else np.array(r[\'embeddings\'])
# threshold cosine sim > 0.92
# Step 4: vLLM yes/no adjudication
prompt = f\'Jesu li ovi nazivi isti klub? A: {a}, B: {b}. Odgovor samo "DA" ili "NE".\'
# Step 5: Cluster i merge
# Master = onaj sa najpopunjenim poljima (web, OIB, predsjednik)
# UPDATE clanovi SET klub_id=master WHERE klub_id IN (others);
# Backup table prije merge: CREATE TABLE klubovi_predeup_<datum> AS SELECT * FROM klubovi;
# DELETE FROM klubovi WHERE id IN (others);
NIKAD auto-merge bez Damir review!
Damir ima 5 backup tabela vec:
klubovi_premerge_20260503,_b,_cklubovi_dedup_20260502,_v2_*,_v3_*
8. DEAD ENDS (proven, NE re-pokretat!)
| Sto | Zasto | Kada |
|---|---|---|
| DuckDuckGo HTML search | 0 rezultata, IP blokiran | 14:50 today |
Sudreg po tvrtke endpointu |
100K subjekata, samo 3 sport | 14:42 |
| HRS api.hrs.hr | Zoraxy reverse proxy, no public endpoint | 14:30 |
| HKS hks.hr | Base URL ne odgovara (try hks-cbf.hr) | 14:36 |
| Vecina saveze /klubovi/ /registar/ | 404 | 14:20 |
WordPress REST /wp-json/wp/v2/{cpt} na ne-HVS savezima |
Samo Jetpack/Akismet, no CPT | 13:45 |
| async_playwright | "anchor not found or already added" | razne sesije |
Sto JE radilo:
- HVS rezultati API JWT token (brutalno brz)
- HOO Playwright (Damirov scraper, 478 PDF, 1447 docs)
- HBS direktan scrape - svi klubovi + 405 bocara
- sport-pgz.hr Playwright crawl (18 godisnjak PDF)
- LLM klub-web enrich preko
/api/v2/enrich/klub-web - Sudreg /sjedista, /pravni_oblici, /tvrtke (s OAuth Bearer)
9. NEXT SESSION TODO (priority order)
A) HITNO - Schema fix-ovi (1 min)
ALTER TABLE pgz_sport.klubovi ADD COLUMN IF NOT EXISTS izvor_unosa TEXT;
ALTER TABLE pgz_sport.natjecanja ADD COLUMN IF NOT EXISTS razina_natjecanja TEXT;
UPDATE pgz_sport.klubovi SET web=...; # crime list
B) Damir seed list FULL crawl (20 min)
- Insert 27 klubova s URL-om u DB (script:
/tmp/seed_pgz.pytreba schema fix) - Mass klub-web LLM enrich za sve 30+ klubova s URL
- Output: ~30 x 30 osoba = 900 osoba (uprava + igraci)
C) Godisnjak LLM ekstrakcija (60-90 min)
- Skripta vec napisana:
/tmp/godisnjak_llm.py - 18 PDF-ova vec u DB-u (vrsta='godisnjak_szpgz')
- Chunk po 5500 znakova, vLLM Qwen 7B (8K context)
- max_workers=5 (NE 10 - GPU pun!)
- Insert s
savez_izvor=\'godisnjak\',metadata.year=2007..2022 - Potencijal: 5K-10K osoba (svi PGZ sportasi 16 godina unazad)
D) sah-pgz.hr/klubovi/ scrape (10 min)
- Damir potvrdio konkretan URL
- Direktan scrape sa Playwright
- Insert nove sahovske klubove
E) NSPGZ.hr scrape (30 min)
- Sluzbena PGZ nogomet web
- Mora imati registar svih PGZ nogometnih klubova + igraci
F) HVS-pattern probe za druge saveze (60 min)
- Playwright sniff svake klub stranice za:
- rezultati.hbs.hr (bocanje)
- rezultati.hks-cbf.hr (kosarka)
- rezultati.hns.hr (nogomet)
- rezultati.hjs.hr (jedrenje)
- rezultati.has.hr (atletika)
- Ako pronadje token: replicira HVS mass scrape
G) ENTITY DEDUP engine (90 min)
- Kreiraj
/opt/rinet-gpu/sport_pipeline/dedup/klub_dedup.py - 4-step pipeline (gore poglavlje 7)
- BACKUP table prije bilo kakvog merge
- Output: report za Damir review
H) Web URL discovery preko Sudreg (alt. za 503 klubova bez weba)
- Iterate sve PGZ udruge -> match na klubovi.naziv (fuzzy)
- Ako ima OIB -> Google site:.hr "OIB" -> nadji web stranicu
- Ako nema -> probat WHOIS, MX record kluba
I) HSA strelicarstvo direktan JSON (5 min)
https://www.hsa.hr/assets/data/index.json(otkriveno)- Probat sve direktne data path-ove
J) Graph DB sync u Qdrant
- Embed klubovi + osobe u
pgz_sport_v1collection - Za buduci sponsorship intelligence + talent scouting
10. KORISTITI POSTOJECI SCRAPER PATTERN (Damirov standard!)
Damir ima vec scrape pattern u /opt/rinet-gpu/sport_pipeline/scrapers/:
hoo_pw_fetch.py- sync_playwright pattern (referenca!)_common.py- upsert_doc() helper- Standardizirani file headers (filename, version, path, project, author, date, description)
Pravila novih skripti:
- Standardizirani header
- sync_playwright (NE async)
- wait_until='networkidle'
- Backup u tablicu prije svake DB modifikacije
- Audit feed insert
- Git commit nakon svakog scrape-a
11. SCRIPTS u /tmp/ (saved on server)
/tmp/hvs_mass.py ✓ HVS mass scrape (re-pokretat za update)
/tmp/hvs_teams.py ✓ HVS team probe (52 PGZ ekipa)
/tmp/hvs_remap.py ✓ HVS remap missing klubovi
/tmp/godisnjak_pull.py ✓ 18 PDF download + DB insert
/tmp/godisnjak_llm.py ⏳ LLM extract (NIJE pokrenut, GPU full!)
/tmp/sport_pgz_scrape.py ✓ sport-pgz.hr crawl
/tmp/seed_pgz.py ⚠️ Damir seed insert (delovi failed - schema)
/tmp/sudreg_mass.py ✗ Loš pristup, NE re-pokretat
/tmp/quickfix.py ✓ POSK Split + savez merge
/tmp/fix_kat.py ✓ Kategorija boćari
/tmp/multi_sniff.py ✓ Multi-savez API discovery
/tmp/sz_pgz.py ✓ Sportska zajednica probe
/tmp/web_discovery.py ✗ DDG blokira IP
/tmp/savez_clubs.py ⚠️ Slaba klub-detekcija po grad mention
12. KRITICNE LEKCIJE I UPOZORENJA
- GPU FULL (89%) - vLLM uzima 8 GB. NE pokretat 2. LLM. Embedder uzima ostalo.
- EMBEDDER input MORA biti list (
{"input":["text"]}) - NIJE string! - /tmp/dis.py shadow bug - ako se vrati, briši:
mv /tmp/dis.py /tmp/dis.py.shadow_DELETED && rm -rf /tmp/__pycache__ - 3-strike rule - ne lazirati "complete" tvrdnje, uvijek
find /opt -name Xili API check - NIKAD srpski/crnogorski u outputu -
_lang_fixfilter aktivan - POSK je iz Splita ne PGZ! (id=3893 marked SDZ)
- AK Kvarner != KK Kvarner 2010 - razlicit klub! Ne fuzzy-match!
- Damir seed list je AUTORITATIVAN - ne overrida-ga
- Bridge API param je
cmdnecommand - DDG/Google search je BLOKIRAN za ovu IP - ne pokušavat
13. PALANTIR VISION (Damirov long-term plan)
Layer 1 (Seed Sources): Damirov seed list + HOO + HNS + HKS + HRS + HVS + HBS + Sudreg
Layer 2 (Discovery): crawler od svakog seed-a, prati linkove "klub", "clanice"
Layer 3 (Extraction): vLLM Qwen 7B + regex + BS4 -> {naziv, sport, grad, email, telefon, OIB, kontakt osobe, drustvene mreze}
Layer 4 (Enrichment): Google query, FB/IG, WHOIS, MX, schema.org, contact page
Layer 5 (Validation): HTTP, SSL, MX, SMTP handshake, robots.txt, canonical URL
Layer 6 (Graph DB): Neo4j ili `civic.entity_connections` graph: klub -> savez -> liga -> drzava + sponzori + politicki utjecaj
Ekstendira na: firme, opcine, skole, političke organizacije, startupovi - isti pipeline, drugi seed source.
Sponsorship intelligence + talent scouting + AI recommendations + regional sports analytics.
14. KONACNI STATUS DAMIRU (na pre-handoff)
Damir je rekao:
"Napravi jebeno dobre scrapere, bez podataka smo nista! Inteligentno filtriraj i spajaj nemoj da imamo 5x isti klub sa varijantama imena!"
Sto sam dosao do:
- 1808 PGZ osoba u DB (745 igrac + 405 bocar + 6 uprava + 652 drugo)
- HVS API breakthrough - jedini savez s pravim pull-om (795 osoba u 50s)
- 18 godisnjak PDF-ova downloaded i u DB (NIJE jos LLM-extracted)
- Damir seed list = autoritativna baza za 27 klubova + 16 saveza + 10 portala statistike
- Schema extended (external_id, savez_izvor, profile_url, uloga_detalj, metadata)
Sto JE problem:
- 503/525 klubova bez web URL-a (root issue)
- Drugi savezi (osim HVS) nemaju javan API -> trebamo direktan klub web scrape
- Entity dedup engine NIJE jos napravljen (samo 5 backup tabela od ranijih ad-hoc merge-ova)
- Godisnjak LLM ekstrakcija NIJE jos pokrenuta (GPU pun!)
Plan za novu sesiju:
- Schema fix (1 min)
- Crime list URL fix (1 min)
- Insert seed klubovi (5 min)
- Pokrenuti godisnjak LLM extract (60-90 min, najveci ROI)
- sah-pgz.hr/klubovi/ scrape (10 min)
- NSPGZ.hr scrape (30 min)
- Entity dedup engine implementacija (90 min)
- HVS-pattern API probe za druge saveze (60 min)
Total dnevno: 4-6 sati, output ~5K-10K novih osoba u DB-u.
END HANDOFF v2
Damir, ucitaj ovaj file u Claude Project knowledge:
- Path:
/opt/pgz-sport/_handoff/HANDOFF_20260503_1410_SPORT_SCRAPING_PIPELINE.md - 600+ linija, sve credentials, sve API tokens, sve URL-ove
- Sljedeca Claude sesija nastavi s tocno gdje smo stali
ADDENDUM (14:30) — KOREKCIJE i POJASNJENJA
Damir je provjerio stvarno stanje stacka. Ispravljam handoff:
A. STACK ISPRAVKE
A.1 Ollama RADI (ranije rekao $HOME panic)
Komanda ollama list u shell trazi $HOME, ali servis na :11434 RADI normalno.
Aktivni modeli (6):
| Model | Velicina | Namjena |
|---|---|---|
nomic-embed-text:latest |
274 MB | Embedding 768-dim (BACKUP za BGE-M3) |
dabi-budget:latest |
4.7 GB | Qwen2 7B Q4_K_M + LoRA (HRT specijal) |
dabi-3b-hr:latest |
6.2 GB | Qwen2 3B F16 (manji LLM) |
qwen2.5:7b |
4.7 GB | Generic Qwen 7B Q4_K_M |
dabi-gemma:latest |
8.6 GB | Gemma3 4B F16 |
gemma3:4b |
3.3 GB | Gemma3 4B Q4_K_M |
Pristup Ollama-i:
# Embed (nomic-embed-text 768-dim)
curl -X POST http://localhost:11434/api/embed \\
-H "Content-Type: application/json" \\
-d \'{"model":"nomic-embed-text","input":["VK Primorje"]}\'
# Response: {"embeddings":[[...]]} (768-dim)
# Chat (LLM)
curl -X POST http://localhost:11434/api/chat \\
-H "Content-Type: application/json" \\
-d \'{"model":"qwen2.5:7b","messages":[{"role":"user","content":"OK"}],"stream":false}\'
A.2 BGE-M3 endpoint (KRITICNO ISPRAVAK!)
/opt/rinet-gpu/embed_service.py ima 3 prihvacena formata:
# Format 1: texts list -> embeddings list
POST /api/embeddings
Body: {"texts": ["text1", "text2"]}
Response: {"embeddings": [[...1024-dim...], [...1024-dim...]]}
# Format 2: single prompt -> single embedding
POST /api/embeddings
Body: {"prompt": "text"}
Response: {"embedding": [...1024-dim...]}
# Format 3: OpenAI compat (model + input)
POST /api/embeddings
Body: {"model":"bge-m3","input":["t1","t2"]}
Response: ako 1 text -> {"embedding":[]}, ako 2+ -> {"embeddings":[[],[]]}
TRUNCATION: text > 2000 znakova SE TRUNCIRA na 2000. MAX SEQ LENGTH: 512 tokena (env BGE_MAX_LEN). FP16: enabled (env BGE_FP16=1). BATCH SIZE: 8 (env BGE_BATCH_SIZE).
Health check: GET /health -> {"status":"ok","model":"bge-m3","device":"cuda","version":"2.0-fixed"}
A.3 F10 LoRA :8765 - NIJE BROKEN, samo /v1/models 404
Port zivi, dabi-budget Q4_K_M (LoRA) je lokalno servisan. Endpoint /v1/models vraca 404 ali je sluzba moguce funkcionalna na drugim path-evima. TODO: provjerit pravilan endpoint.
B. GODISNJACI - PRAVI PATH (ne /downloads/!)
Stvarna lokacija
Damirov pravi path: /opt/pgz-sport/_data/godisnjaci/
54 fileova ukupno (195 MB):
19 .pdf (originali, 2006-2024)
19 .txt (osnovni OCR)
19 _layout.txt (OCR sa pdftotext -layout)
+ 18 backup u `/opt/pgz-sport/_downloads/godisnjaci_szpgz/` (dupli, manje korisni)
Godine
2006, 2007, 2008, 2009, 2010, 2011, 2012, 2013, 2014, 2015, 2016, 2017, 2018, 2019, 2020, 2021, 2022, 2023, 2024
Ja sam ranije mislio samo 2006-2022 -> POGRESNO, postoji i 2023, 2024!
Vec OCR-irani -> ne treba pdftotext
Damir vec ima _layout.txt za svaki godisnjak (preserved formatting). Koristiti layout verziju za parsing, jer ima boljo razdvajanje stupaca.
import glob
files_layout = sorted(glob.glob("/opt/pgz-sport/_data/godisnjaci/godisnjak_*_layout.txt"))
files_basic = sorted(glob.glob("/opt/pgz-sport/_data/godisnjaci/godisnjak_2*.txt"))
# Layout verzija je preferirana za parsing tabela imena
C. EMBEDDING STRATEGIJA za godisnjake
Damirova preporuka: BGE-M3 :9879 (Opcija A)
- 1024-dim vektori
- HR jezik dobro suport
- Vec u VRAM-u (gpu memory drzi model)
- Direktno kompatibilno s
pgz_universecollection (47,405 points, 1024-dim Cosine)
Backup: Ollama nomic-embed (Opcija B)
- 768-dim
- CPU-friendly
- NE kompatibilno s pgz_universe -> trebala bi nova collection (npr.
pgz_godisnjaci_768) - Slabija kvaliteta za HR jezik
Konkretan ingest skripta:
import asyncio, glob, httpx, hashlib, json, re
from qdrant_client import QdrantClient
from qdrant_client.models import PointStruct
async def embed(texts): # BGE-M3, 1024-dim
async with httpx.AsyncClient(timeout=60.0) as c:
r = await c.post("http://localhost:9879/api/embeddings",
json={"texts": texts})
return r.json()["embeddings"]
def chunk_text(t, size=1500): # < 2000 znakova zbog truncation
paragraphs = re.split(r"\\n\\n+", t)
chunks, cur = [], ""
for p in paragraphs:
if len(cur) + len(p) > size:
if cur: chunks.append(cur.strip())
cur = p
else:
cur += "\\n\\n" + p
if cur: chunks.append(cur.strip())
return chunks
async def main():
qdrant = QdrantClient(host="localhost", port=6333)
files = sorted(glob.glob("/opt/pgz-sport/_data/godisnjaci/godisnjak_*_layout.txt"))
print(f"Files: {len(files)}")
all_chunks = []
all_meta = []
for f in files:
year = re.search(r"godisnjak_(\\d{4})", f).group(1)
with open(f) as fp: text = fp.read()
chunks = chunk_text(text)
for i, c in enumerate(chunks):
all_chunks.append(c)
all_meta.append({"year": year, "chunk_idx": i, "source": f.split("/")[-1]})
print(f"Total chunks: {len(all_chunks)}")
# Batch embed (BGE-M3 batch_size=8)
BATCH = 32
points = []
for i in range(0, len(all_chunks), BATCH):
batch = all_chunks[i:i+BATCH]
embeddings = await embed(batch)
for j, (text, emb) in enumerate(zip(batch, embeddings)):
meta = all_meta[i+j]
point_id = int(hashlib.md5(f"godisnjak:{meta[\'source\']}:{meta[\'chunk_idx\']}".encode()).hexdigest()[:15], 16)
points.append(PointStruct(
id=point_id,
vector=emb,
payload={**meta, "text": text[:1500], "type": "godisnjak_pgz"}
))
if i % 200 == 0:
print(f" ... embedded {i}/{len(all_chunks)}")
# Upsert into pgz_universe (vec ima 47K points, dodati 1500-2000 godisnjak chunks)
qdrant.upsert(collection_name="pgz_universe", points=points)
print(f"Done: {len(points)} chunks ingested")
asyncio.run(main())
D. KORISTITI vLLM ZA EKSTRAKCIJU IMENA + ULOGA
Za parsing osoba/uloga iz godisnjaka, koristit vLLM Qwen 7B:
import httpx, json, re
PROMPT = """Ekstrahiraj iz teksta SVA imena osoba i njihove uloge.
Format strogo JSON:
{"osobe": [{"ime":"X","prezime":"Y","klub":"Z","uloga":"predsjednik|igrac|trener|tajnik|fizioterapeut|lijecnik","godina_rodenja":1990}]}
Uloge ISKLJUCIVO: predsjednik, dopredsjednik, tajnik, blagajnik, clan_uprave, igrac, sportas, glavni_trener, trener, pomocni_trener, kondicioni_trener, selektor, izbornik, team_manager, voditelj, lijecnik, fizioterapeut, kineziolog, maser, sudac, volonter
Pravila:
1. Samo HRVATSKE osobe (ne strani sportasi koji su gostovali)
2. Ako klub nije jasan -> ostavi prazan klub
3. NE izmisljaj imena -> samo ona JASNO IZRAZENA u tekstu
4. Vrati VALID JSON bez markdown ```"""
async def extract(chunk_text):
async with httpx.AsyncClient(timeout=120.0) as c:
r = await c.post("http://localhost:8001/v1/chat/completions",
json={
"model": "Qwen/Qwen2.5-7B-Instruct-AWQ",
"messages": [
{"role": "system", "content": PROMPT},
{"role": "user", "content": chunk_text}
],
"temperature": 0.1,
"max_tokens": 3000,
"response_format": {"type": "json_object"}
})
d = r.json()
try:
return json.loads(d["choices"][0]["message"]["content"])
except:
return {"osobe": []}
Throughput: vLLM ~2-3s po chunk-u (5500 chars). 19 godina x ~50 chunks = 950 chunks ~ 30-40 min sa parallel=5. GPU: vec 89% pun, vLLM uzima 8 GB. Embed ne uzima vise (vec u VRAM). max_workers=5!
E. STRATEGIJA ZA NOVU SESIJU
Step 1: Embed godisnjake u pgz_universe (10 min)
- 19 godina x ~50 chunks = 950 chunks
- BGE-M3 batch=8, ~2 min ukupno
- Direktno u pgz_universe (1024-dim Cosine kompatibilan)
Step 2: LLM ekstrakcija osoba/uloga (30-40 min)
- Isti chunks kroz vLLM Qwen 7B
- max_workers=5 (GPU pun)
- Output: ~5K-10K osoba sa savez_izvor='godisnjak', metadata.year=YYYY
Step 3: Match osobe na klubove (10 min)
- Za svaku ekstrahiranu osobu, fuzzy match na pgz_sport.klubovi.naziv
- Ako match -> set klub_id
- Ako ne -> insert klub kao novi PGZ kandidat
Step 4: Entity dedup za nove klubove (30 min)
- 4-step pipeline (gore u handoff sekcija 7)
- Backup table prije bilo kakvog merge
- Damir review prije commit
F. KORISTI POSTOJECI WHEEL — ne re-implementacija!
Postojeci alati:
/opt/rinet-gpu/sport_pipeline/scrapers/_common.py->upsert_doc()/opt/rinet-gpu/embed_service.py-> BGE-M3 service/opt/rinet-gpu/sport_pipeline/scrapers/hoo_pw_fetch.py-> sync_playwright patternpgz_universecollection -> 47K points 1024-dim Cosine (vec spreman)
NE radi:
- Novi embedder
- Nova Qdrant collection (kompatibilno s pgz_universe)
- Async Playwright (sync_playwright je standard)