Embeddings & Citation-Pipeline¶
Semantische Suche in Wahlprogrammen, Zitat-Rekonstruktion und PDF-Highlighting.
Retrieval¶
app.embeddings.find_relevant_chunks(query, parteien=None, typ=None, bundesland=None, top_k=3, min_similarity=0.5)
¶
Find most relevant chunks for a query.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
bundesland
|
str
|
Wenn gesetzt, werden nur Chunks dieses Bundeslands ODER globale Chunks (bundesland IS NULL, z.B. Grundsatzprogramme) berücksichtigt. Wenn None, kein Filter. |
None
|
app.embeddings.get_relevant_quotes_for_antrag(antrag_text, fraktionen, bundesland, top_k_per_partei=2)
¶
Get relevant quotes from Wahl- and Parteiprogramme for an Antrag.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
bundesland
|
str
|
Pflicht. Bestimmt, welche Wahlprogramme durchsucht werden und welche Regierungsfraktionen zusätzlich zu den Antragstellern einbezogen werden. |
required |
Prompt-Formatierung¶
app.embeddings.format_quotes_for_prompt(quotes, searched_parties=None)
¶
Format quotes for inclusion in LLM prompt.
Each chunk gets a stable ENUM-ID ([Q1], [Q2], …) and the prompt instructs the LLM to anchor every citation in one of those IDs and to copy the snippet verbatim from the cited chunk. This is the structural fix for Issue #60: pre-#60 the LLM was free to invent snippets under real source labels because nothing in the prompt bound a citation to a specific retrieved chunk.
Each quote is annotated with the fully-qualified source (programme name + page) so the LLM cannot fall back on training-set defaults when constructing its citations.
Issue #63 erweitert: wenn searched_parties übergeben wird, werden
Parteien, für die kein Chunk retrievt wurde, im Prompt explizit
als "keine Quellen im Index" markiert. Das LLM wird angewiesen, für
diese Parteien score: null zu setzen statt aus dem Trainingswissen
zu raten.
Citation Post-Processing (Issue #60)¶
app.embeddings.reconstruct_zitate(data, semantic_quotes)
¶
Replace LLM-emitted quelle/url with canonical chunk values; drop unbacked.
Walks over data['wahlprogrammScores'][i][kind]['zitate'] (the raw
LLM-output dict, not the Pydantic model). For each Zitat:
- Locate the chunk whose text contains the snippet (or a 5-word anchor from it). Search across all retrieved chunks regardless of party, so cross-mixes between Q-IDs become invisible to the persisted output.
- If found: overwrite
quelleandurlwith values derived from the matching chunk'sprogramm_id+seite. The LLM is no longer trusted for these fields. - If not found: drop the Zitat entirely.
Returns the same data dict (mutated in place) for chaining.
app.embeddings.find_chunk_for_text(text, chunks)
¶
Locate the retrieved chunk that a Zitat snippet was copied from.
Two-stage match identical to Sub-D
- Strict substring — full needle as substring of any chunk.
- 5-word anchor — any 5 consecutive words of the needle as substring of any chunk.
Snippets shorter than 20 characters are rejected (too weak to bind). Returns the matching chunk dict, or None.
PDF-Highlighting (Issue #47)¶
app.embeddings.render_highlighted_page(programm_id, seite, query)
¶
Render a single Wahlprogramm-page with yellow highlights for a query.
Used by the /api/wahlprogramm-cite endpoint to serve a one-page
PDF where the cited snippet is visually highlighted via PyMuPDF
add_highlight_annot. Returns the serialized PDF bytes, or None
if the programme/page can't be resolved.
Returns a tuple (pdf_bytes, found_page, highlighted) where
found_page is the 1-indexed page number and highlighted is
True if the text was found and annotated. Returns (None, 0, False)
if the programme/page can't be resolved.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
programm_id
|
str
|
Key into PROGRAMME registry — validated by caller. |
required |
seite
|
int
|
1-indexed page number within the programme PDF. |
required |
query
|
str
|
Snippet text to search and highlight on the page. Long
queries are truncated to the first 200 characters before the
search; PyMuPDF's |
required |
Indexierung¶
app.embeddings.index_programm(programm_id, pdf_dir)
¶
Index a single program PDF into embeddings database.