Heade näitelausete automaattuvastamine eesti keele õppesõnastike jaoks

Kristina Koppel

Abstract


Artiklis keskendutakse tööriista Good Dictionary Example ehk GDEX (Kilgarriff jt 2008) eesti mooduli versiooni 1.4 loomisele. GDEX on tööriist, mis aitab sõnastiku näitelauseks sobivaid korpuslauseid automaatselt tuvastada. GDEX-i moodul on seni loodud inglise, sloveeni, hollandi, portugali, hispaania, jaapani ja eesti keele jaoks. Siinses artiklis seletatakse esmalt lahti tööriista üldised tööpõhimõtted. Seejärel keskendutakse näitelauseid tuvastavate parameetrite statistilisele analüüsile ja parameetrite väärtuste määramisele. Parameetrite väärtuste hindamisele ning eri moodulite võrdlusele toetudes pakutakse välja eesti mooduli uus versioon 1.4. 

"Automatic detection of good dictionary examples in Estonian learner’s dictionaries"

This paper explains, firstly, how a tool called Good Dictionary Example (GDEX) (Kilgarriff et. al 2008) scores corpus sentences and helps the lexicographer automatically select the best examples for dictionaries. Secondly, the training datasets containing example sentences from the Estonian Collocations Dictionary (ECD) are introduced. Thirdly, the paper focuses on different parameters of good dictionary examples.

Most of the paper is based on an analysis of the training datasets and an evaluation of the previous GDEX configurations. For evaluating the configurations, the graphical user interface GDEX Editor was used. Based on the results of statistical analysis and on the evaluation of different configurations, a new configuration 1.4 is introduced. There are 16 new parameters implemented in GDEX 1.4.

The main parameters of GDEX 1.4 are as follows: the desired sentence is a full sentence; sentence length is 4–20 tokens; the sentence contains a verb; it does not contain low frequency words or words from the blacklist; the optimal length is 6–12 tokens; sentences containing more than 1 adverb, pronoun, proper name, numeral, conjunction, comma, more than 2 verbs and sentences containing certain pronouns are penalized.

The output of GDEX 1.4 can be applied to the ECD project and to create a web interface SkELL for learners of Estonian. 


Keywords


korpusleksikograafia, korpuslingvistika, õppeleksikograafia, keeleõpe, kollokatsioonid, näitelaused, GDEX, eesti keel

Full Text:

PDF

References


Baisa, Vít; Suchomel, Vít 2014. SkELL: Web Interface for English Language Learning. – Eighth Workshop on Recent Advances in Slavonic Natural Language Processing. Brno: Tribun EU, 63–70.

Kaalep, Heiki-Jaan 1998. Tekstikorpuse abil loodud eesti keele morfoloogiaanalüsaator. [An Estonian morphological analyser and using a corpus on its development.] – Keel ja Kirjandus, 1, 22−29.

Kallas, Jelena; Koppel, Kristina; Tuulik, Maria 2015. Korpusleksikograafia uued võimalused eesti keele kollokatsioonisõnastiku näitel. [New possibilities in Corpus Lexicography based on the Examples of the Estonian Collocation Dictionary.] – Eesti Rakenduslingvistika Ühingu aastaraamat, 11, 75–94. http://dx.doi.org/10.5128/ERYa11.05

Kilgarriff, Adam; Husák, Milos; McAdam, Katy; Rundell, Michael; Rychlý, Pavel 2008. GDEX: Automatically finding good dictionary examples in a corpus. – E. Bernal, J. DeCesaris (Eds.), Proceedings of the 13th EURALEX International Congress. Barcelona: Institut Universitari de Linguistica Aplicada, Universitat Pompeu Fabra, 425–432.

Kilgarriff, Adam; Rychlý, Pavel; Smr, Pavel; Tugwell, David 2004. The Sketch Engine. – G. Williams, S. Vessier (Eds.), Proceedings of the 11th EURALEX International Congress. Lorient, France: Université de Bretagne Sud, 105–115.

Koppel, Kristina; Kallas, Jelena 2016. Õppijasõbralik korpuslause: automaatse valiku võimalusi. [User-friendly corpus sentence: Parameters for automatic selection.] – Lähivõrdlusi. Lähivertailuja, 26, 222−250. http://dx.doi.org/10.5128/LV26.07

Kosem, Iztok; Gantar, Polona; Krek, Simon 2013. Automation of lexicographic work: An opportunity for both lexicographers and crowd-sourcing. – I. Kosem, J. Kallas, P. Gantar, S. Krek, M. Langemets, M. Tuulik (Eds.), Electronic Lexicography in the 21st Century: Thinking Outside the Paper. Proceedings of the eLex 2013, 17–19 October 2013. Ljubljana–Tallinn: Trojina, Institute for Applied Slovene Studies, Eesti Keele Instituut, 32–48.

Helmut Schmid 1994. Probabilistic part-of-speech tagging using decision trees. – Proceedings of International Conference on New Methods in Language Processing. Manchester, UK, 44–49.




DOI: http://dx.doi.org/10.5128/ERYa13.04

Refbacks

  • There are currently no refbacks.


Copyright (c) 2017 Kristina Koppel

Creative Commons License
This work is licensed under a Creative Commons Attribution-NonCommercial 4.0 International License.

ISSN 1736-2563 (print)
ISSN 2228-0677 (online)
DOI 10.5128/ERYa.1736-2563