Eesti vanade sõnakujude tuvastamisest suurte keelemudelitega

Madis Jürviste, Tiina Paet, Sven-Erik Soosaar

Abstract


https://doi.org/10.5128/ERYa21.04

Artiklis kirjeldatakse suurte keelemudelite võimekust vanade (17. ja 18. saj) eesti keelt sisaldavate sõnastike sisu analüüsimisel. Autorid korraldasid kolme suure keelemudeliga (GPT-4o, Gemini 1.5 Pro ja Claude 3 Opus) kokku kolm katset. Esimese katse valimis olid vanad ametinimetused ja sotsiaalsed rollid, teises valimis lõunaeesti sõnad, kolmandas vanad laensõnad. Katsete tulemused näitavad keelemudelite suurt potentsiaali vanade sõnakujude ühendamisel nüüdiskujudega: tulevikus saaks seda rakendada sõnastikes märksõnade diakroonilisel kirjeldamisel koos viidetega varasematele esinemisaegadele ja -kohtadele.

***

"Identifying Old Estonian word forms using large language models"

As large language models (LLMs) have gained more and more visibility and momentum in society since 2022, numerous researchers have studied the possibilities of applying these new technologies for research in lexicography. This article deals with historical sources: how useful are LLMs in identifying old word forms in 17th and 18th-century German-Estonian and Estonian-German dictionaries? More precisely, can these technologies reduce the time burden on human researchers to identify old word forms and connect them with the same words’ modern written forms (even if the original word itself has been substituted by a completely new one over the centuries)? To answer these questions, the authors conducted an empirical qualitative study with three major LLMs: GPT-4o, Gemini 1.5 Pro and Claude 3 Opus. The study consisted in analysing the LLMs capacities and success rates using API-request-based prompts in three main tests, each with different samples: 30 old professional titles and societal roles’ denominations (6 sources ranging from Stahl 1637 up to Hupel 1780); 54 dialectal words (in Gutslaff 1648) and 20 borrowed words (in 3 sources: Stahl 1637, Gutslaff 1648, and Göseken 1660). In these tests, Claude generally outperformed all the others. However, the results show variations due to the sample words’ characteristics (words with a similar orthography are more easily recognised). The high success rate, ranging from 74% to 90%, incites the authors to consider the possibility of carrying out tests with a larger sample, possibly encompassing whole dictionaries. This would significantly help lexicographers to create a diachronic historical development path for different words in the entries of large Estonian monolingual explanatory dictionaries.


Keywords


ajalooline leksikograafia; eesti kirjakeele ajalugu; suured keelemudelid; eesti keel; historical lexicography; history of Estonian written language; large language models; Estonian

Full Text:

PDF


DOI: http://dx.doi.org/10.5128/ERYa21.04

Refbacks

  • There are currently no refbacks.


Copyright (c) 2025 Madis Jürviste, Tiina Paet, Sven-Erik Soosaar

Creative Commons License
This work is licensed under a Creative Commons Attribution-NonCommercial 4.0 International License.

ISSN 1736-2563 (print)
ISSN 2228-0677 (online)
DOI 10.5128/ERYa.1736-2563