Korpuse kontekstiga või ilma? Suurte keelemudelite võimekus tuvastada ja märgendiga varustada eesti keele sõnatähendusi

Lydia Risberg; Eleri Aedmaa; Hanna Pook; Kristina Koppel; Maria Tuulik; Esta Prangel; Margit Langemets

doi:10.5128/ERYa22.10

Korpuse kontekstiga või ilma? Suurte keelemudelite võimekus tuvastada ja märgendiga varustada eesti keele sõnatähendusi

Lydia Risberg, Eleri Aedmaa, Hanna Pook, Kristina Koppel, Maria Tuulik, Esta Prangel, Margit Langemets

Abstract

Artiklis kirjeldame katset, milles võrdlesime suurte keelemudelite (SKM) võimekust tuvastada sõnatähendusi, määrata neile register ja pakkuda sobivaid märgendeid, ühel juhul eesti keele ühendkorpusest (2023) pärit kontekstide toel ja teisel juhul eeltreenitud SKM-ide keelelistele teadmistele tuginedes. Kõigi SKM-ide väljund hinnati adekvaatsemaks, kui need toetusid korpusandmetele, sealjuures parima tulemuse eri aspektides andis Claude Opus 4.1. Näiteks sõnatähenduste tuvastamisega said SKM-id edukamalt hakkama korpusematerjali toel, eeltreenitud SKM-ide pakutud tähendused jäid oletuslikumaks. Registri määramisega tulid SKM-id mõlemal juhul hästi toime. Pakutud märgendite kattuvus EKI ühendsõnastiku (ÜS 2025) omadega aga varieerus SKM-iti rohkem.

***

"With or without corpus data? Large language models’ capability to detect and label word senses in Estonian"

Lexicographers occasionally face difficulties in deciding whether a particular word sense requires a usage label (e.g., colloquial). In descriptive lexicography, labels are based on empirical language data rather than on intuition. In recent decades, corpus data has become a reliable foundation for such decisions, replacing earlier reliance on individual judgment or small card indexes. However, the potential of large language models (LLMs) to assist in labeling task has only recently begun to be explored. This study investigates whether and how LLMs can support Estonian lexicographers in assigning dictionary labels to word senses. We compared the performance of three models – Claude Opus 4.1, Gemini 2.5 Pro, and GPT-4o – when asked to identify Estonian word senses and assign register and label information, using either selected contexts from the Estonian National Corpus (2023) that contained the target word or in a zero-shot setting. The LLMs’ outputs were evaluated both manually and through automated analysis. All LLMs performed more accurately when corpus data were provided, with Claude yielding the best overall results. LLMs generally handled register assignment (e.g., informal vs. neutral/formal) more effectively than sense identification. Although the adequacy of generated meanings varied, LLMs demonstrated a promising ability to detect usage variation in authentic Estonian corpus material. Claude also showed potential for automatically matching its identified senses with those in the EKI Combined Dictionary. While LLM outputs still require expert supervision, the results suggest that LLMs can assist lexicographers in reducing subjectivity and workload when determining usage labels. As of October 2025, Claude appears to be the most promising tool for Estonian.

Keywords

suured keelemudelid; eesti keele ühendkorpus; leksikograafia; register; sõnastikumärgendid; eesti keel; large language models; Estonian National Corpus; lexicography; register; usage labels; Estonian

Full Text:

PDF

DOI: http://dx.doi.org/10.5128/ERYa22.10

Refbacks

There are currently no refbacks.

This work is licensed under a Creative Commons Attribution-NonCommercial 4.0 International License.

ISSN 1736-2563 (print)
ISSN 2228-0677 (online)
DOI 10.5128/ERYa.1736-2563

Username
Password
Remember me

Eesti Rakenduslingvistika Ühingu aastaraamat / Estonian Papers in Applied Linguistics

Korpuse kontekstiga või ilma? Suurte keelemudelite võimekus tuvastada ja märgendiga varustada eesti keele sõnatähendusi

Abstract

Keywords

Full Text:

Refbacks