Käändevormist sõnaks: mida näitab sagedus?

Ene Vainik, Geda Paulsen, Ahti Lohk

Abstract


Artikkel tegeleb nimisõnavormide iseseisvumise küsimusega leksikograafia vajadustest lähtudes. Eeldusel, et abstraktseid käändevorme iseloomustab korpuses üldine püsiv esiletuleku proportsioon, pakume välja statistilise mõõdiku – distributsiooniindeksi, mille abil otsustada, kas sõnavormi kasutussagedus on piisav selleks, et lugeda teda paradigmast emantsipeerunuks ning seega iseseisva märksõna kandidaadiks. Indeks arvestab vormi suhtelist sagedust korpuses, võrreldes tegelikku ja normi põhjal oodatavat kasutussagedust, ning laseb samale skaalale paigutada väga erineva absoluutsagedusega juhtumeid. Artiklis illustreerime distributsiooniindeksi toimivust tavaliste rikkaliku vormistikuga nimisõnade ning ambivorme andvate sõnade vormistike võrdlusena. Seame provisoorse indeksi lävendväärtuse, millest suurema väärtusega vormi võib pidada iseseisvaks lekseemiks. Indeksit ning lävendväärtust testitakse erinevate korpuste (EtTenTen13, ÜK 2019) andmete peal.

***

From inflected form to a word: The role of frequency

This study is motivated by the need for a statistical benchmark that would help the lexicographer to judge a morphological form for its grammaticalization stage to the degree of an independent lexeme. The focus of this article is on Estonian substantives and in particular their forms in the 11 semantic cases. The choice of selection is based on the observation that the noun has a special position among the word classes with fuzzy categorial borders (Vainik et al. 2020) – the means of nominal morphology function as a source for ongoing processes of grammaticalization in Estonian. The case forms typically yield adverbs and adpositions – a phenomenon forming a part of ambiforms (words or forms that can be interpreted to belong to more than one word class).

The main research question of this study is: is there a statistical sign indicating that a case form of a noun is emerging as a potentially independent lexeme? Based on the normal distribution of nominal case form frequencies, we established a statistic that determines a case form’s elicitation in a corpus – the distribution index (D-index). The D-index can be used as an indicator of the correspondence of a particular form’s actual frequency with the predicted elicitation degree.

The D-index was tested by a sample of ambiforms (N = 46) and “ordinary” nouns (N = 26), the last group including nouns that display an abundant range of semantic case forms. This sample was used to extract all semantic case forms from altogether three Estonian corpora: the Balanced Corpus of Estonian, the National Corpus of Estonian 2019, and the web corpus etTenTen13. Based on the analysis, we defined a threshold value (≥ 0,130), indicating that the forms with higher D-indexes than this value can be regarded as independent lexemes.

We conclude that the threshold value functions as a benchmark to a certain degree: an ambiform with a D-index over the threshold value is a distinctly independent lexeme. The forms with D-indexes below the threshold value may or may not be candidates of a lexical entry in a dictionary – the statistical parameters are not sufficient to make a waterproof decision. A lexicographer’s qualitative analysis will be needed in those cases.


Keywords


leksikograafia, korpuslingvistika, keeletehnoloogia, käändevormide iseseisvumine, sõnaliigid, eesti keel, lexicography, corpus linguistics, language technology, word form emancipation, parts of speech, Estonian

Full Text:

PDF


DOI: http://dx.doi.org/10.5128/ERYa17.16

Refbacks

  • There are currently no refbacks.


Copyright (c) 2021 Ene Vainik, Geda Paulsen, Ahti Lohk

Creative Commons License
This work is licensed under a Creative Commons Attribution-NonCommercial 4.0 International License.

ISSN 1736-2563 (print)
ISSN 2228-0677 (online)
DOI 10.5128/ERYa.1736-2563