Kuidas ära tunda adjektiivi? Korpuskäitumise mustrite analüüs

Maria Tuulik, Ene Vainik, Geda Paulsen, Ahti Lohk

Abstract


Artiklis uurime adjektiivi morfosüntaktilisi tunnuseid ja selgitame, kuivõrd on prototüüpsele adjektiivile omistatavad tunnused (nt ühildumine, võrrete moodustamine) adjektiiviklassile eriomased. Loome neile tunnustele tuginedes parameetrid, mille abil eristame korpuse andmete põhjal adjektiive teistest sõnaliikidest. Tüüpilise adjektiivi korpusprofiili tuvastamise kaugem eesmärk on rakenduse loomine, mis võimaldaks leksikograafidel ebaselgete juhtumite puhul kontrollida sõna adjektiviseerumise astet. Tutvustame kuue parameetri testimise tulemust 12 sõnarühma peal, millest igaühte kuulub 10 sõna. Sõnavalikul arvestame adjektiiviklassi piiripealseid juhtumeid ja leksikograafilisi kitsaskohti. Analüüsime, mil määral hälbivad erinevad testrühmad testitud parameetrite põhjal prototüüpsest adjektiiviklassi esindajast ning vaatleme ka variatsiooni adjektiiviklassi sees. Kõrvale kaldeanalüüs võimaldab välja selgitada parima eristusvõimega parameetrid. Eukleidilise kauguse mõõtmine eristab hästi adjektiivisarnased sõnad ja rühmad nendest, mis sarnanevad prototüüpsele adjektiivile vähem.

***

How to recognize adjectives? An analysis of corpus patterns

This study was inspired by a survey of Estonian lexicographers (Paulsen, Vainik and Tuulik 2019), where the lexicographers expressed the need for a new digital tool that would facilitate word class identification for ambiguous cases. In the case of adjectives, the lexicographers emphasized the difficulty of determining if a verb participle has sufficient adjectival use to be included in dictionaries as an adjective. 

In the article, we examine the morphosyntactic features characteristic of the adjective class and test different parameters in the corpora to differentiate adjectives from other word classes. We provide an overview of the test results of six parameters. In the study we analysed 12 groups of 10 words each. The test groups and test words were chosen manually, with consideration given to the problematic cases outlined by the lexicographers. We compared different types of adjectives or near to adjectives (the test groups) as well as different word classes (the control groups). 

To analyse the parameters’ capability to set adjectives apart, a deviation study was conducted. We determined a normative range for prototypical adjectives and set the minimum and maximum value for every parameter. In addition, we calculated the deviation of other test groups from the prototypical adjective range.

The groups of particular focus (regular verb participles vs. adjectives) were best differentiated by three parameters. The sentence beginning testword+noun parameter (which determined if and how often a test word starts a sentence in the complement position) sets participles apart with 90% accuracy. Also, the parameter that measured the existence of comparative forms for test words was 100% accurate. The adverb parameter (which measured how often a test word is preceded by an adverb) was able to distinguish adjectives from verb participles with 80% accuracy. Among all groups, the comparative form parameter was the most accurate in the deviation study at setting prototypical adjectives apart from other test groups. 

A Euclidean distance analysis was able to differentiate adjective-like test words and test groups from others that do not behave similarly to prototypical adjectives. As all tested parameters produced meaningful results and were able to differentiate some word classes from adjectives, they can be input for a new digital tool which would show a word’s deviation from prototypical word class representatives to help lexicographers with word-class-related decisions.


Keywords


sõnaliigid, morfosüntaks, leksikograafia, keeletehnoloogia, eesti keel; parts of speech, morphosyntax, lexicography, language technology, Estonian

Full Text:

PDF


DOI: http://dx.doi.org/10.5128/ERYa18.16

Refbacks

  • There are currently no refbacks.


Copyright (c) 2022 Maria Tuulik, Ene Vainik, Geda Paulsen, Ahti Lohk

Creative Commons License
This work is licensed under a Creative Commons Attribution-NonCommercial 4.0 International License.

ISSN 1736-2563 (print)
ISSN 2228-0677 (online)
DOI 10.5128/ERYa.1736-2563