Eesti keele ühendverbide kompositsionaalsuse määramine

Eleri Aedmaa

Abstract


Keele automaattöötluse jaoks on püsiühendite tuvastamine oluline ülesanne, mille lahendamiseks on püütud ühendeid eri meetodeid rakendades automaatselt klassifitseerida ning nende kompositsionaalsust määrata. Artiklis rakendatakse sõnadevahelise seose tugevuse mõõtmise statistilisi meetodeid eesti keele ühendverbide automaatseks klassifitseerimiseks nende tähenduse moodustamise viisi alusel ning vaadeldakse, millise meetodi tulemused on kõige paremad ja kas need on piisavalt head, et ühendverbide jaotus võiks sellele meetodile tugineda. Uurimuse põhieesmärk on välja selgitada, kas distributiivse semantika vahendeid rakendades on võimalik automaatselt kindlaks määrata eesti keele püsiühendite kompositsionaalsuse taset. Selleks tutvustatakse ja rakendatakse distributiivsel semantikal põhinevat tarkvara word2vec. 

Detecting the compositionality of Estonian particle verbs

The purposes of this article are to automatically classify Estonian particle verbs and detect their degree of compositionality. In order to group particle verbs, the lexical association measures (AMs) are compared. For the detection of the degree of compositionality of Estonian particle verbs, a model based on distributional semantics is used. The experiment is carried out with the word2vec tool, using a continuous bag-of-words model which predicts the word given its context.

The analysis of the comparison of AMs revealed that none of the AMs used achieve high enough precision values to classify the particle verbs. Hence, it can be assumed that Estonian particle verbs cannot be divided cleanly into the classes of compositional and non-compositional particle verbs, but rather populate a continuum between entirely compositional and entirely non-compositional expressions.

The experiment of assessing the degree of compositionality of the particle verbs using distributional semantic model proved successful. It is demonstrated that the value of cosine similarity can predict the degree of compositionality of particle verbs. However, in order to evaluate the method introduced here, it is important to create a ranking of human judgement on semantic compositionality for a series of particle verbs and base verbs to which they correspond. 



Keywords


distributional semantics, natural language processing, multiword expressions, particle verbs, Estonian

Full Text:

PDF

References


Aedmaa, Eleri 2015. Statistilised meetodid ühendverbide tuvastamisel tekstikorpusest. [Statistical methods for Estonian particle verb extraction from text corpora.] – Eesti Rakenduslingvistika Ühingu aastaraamat, 11, 37–54. http://dx.doi.org/10.5128/ERYa11.03

Baldwin, Timothy; Villavicencio, Aline 2002. Extracting the unextractable: A case study on verb-particles. – Proceedings of the Conference on Computaional Natural Language Learning (CoNLL 2002), Taipei, Taiwan, 31 August – 1 September 2002. Association for Computational Linguistics, 1–7. http://dx.doi.org/10.3115/1118853.1118854

Bannard, Colin 2005. Learning about the meaning of verb–particle constructions from corpora. – Computer Speech & Language, 19 (4), 467–478. http://dx.doi.org/10.1016/j.csl.2005.02.003

Bannard, Colin; Baldwin, Timothy; Lascarides, Alex 2003. A statistical approach to the semantics of verb-particles. – Proceedings of the ACL 2003 workshop on Multiword expressions: Analysis, acquisition and treatment, Vol. 18. Association for Computational Linguistics, 65–72. http://dx.doi.org/10.3115/1119282.1119291

Baroni, Marco; Dinu, Georgiana; Kruszewski, Germán 2014. Don’t count, predict! A systematic comparison of context-counting vs. context-predicting semantic vectors. – Proceedings of the 52nd Annual Meeting of the Association for Computational Linguistics, Vol. 1, 238–247. http://dx.doi.org/10.3115/v1/p14-1023

Blaheta, Don; Johnson, Mark 2001. Unsupervised learning of multi-word verbs. – Proceedings of the ACL/EACL 2001 Workshop on the Computational Extraction, Analysis and Exploitation of Collocations, 54–60.

Bott, Stefan; Schulte im Walde, Sabine 2014. Optimizing a distributional semantic model for the prediction of German particle verb compositionality. – Proceedings of the 9th Conference on Language Resources and Evaluation, Reykjavik, Iceland.

Bruni, Elia; Tran, Nam-Khanh; Baroni, Marco 2014. Multimodal distributional semantics. – Journal of Artificial Intelligence Research (JAIR), 49, 1–47.

Bullinaria, John A.; Levy Joseph P. 2007. Extracting semantic representations from word co-occurrence statistics: A computational study. – Behavior Research Methods, 39 (3), 510–526. http://dx.doi.org/10.3758/BF03193020

EKG II = Erelt, Mati; Reet Kasik; Helle Metslang; Henno Rajandi; Kristiina Ross; Henn Saari; Kaja Tael; Silvi Vare 1993. Eesti keele grammatika II. Süntaks. Lisa: kiri. [The Grammar of the Estonian Language II: Syntax.] Eesti Teaduste Akadeemia Keele ja Kirjanduse Instituut. Tallinn.

EKSS = Eesti keele seletav sõnaraamat I–VI. [The Explanatory Dictionary of

Estonian.] Margit Langemets, Mai Tiits, Tiia Valdre, Leidi Veskis, Ülle Viks, Piret Voll (Toim.). Eesti keele instituut. Tallinn: Eesti Keele Sihtasutus, 2009.

Erelt, Mati 2013. Eesti keele lauseõpetus. Sissejuhatus. Öeldis. [Estonian Syntax. Introduction.] Tartu ülikooli eesti keele osakonna preprindid 4. Tartu Ülikool.

Erk, Katrin; Padó, Sebastian 2008. A structured vector space model for word meaning in context. – Proceedings of the Conference on Empirical Methods in Natural Language Processing. Association for Computational Linguistics, 897–906. http://dx.doi.org/10.3115/1613715.1613831

Evert, Stefan; Krenn, Brigitte 2001. Methods for the qualitative evaluation of lexical association measures. – Proceedings of the 39th Annual Meeting on Association for Computational Linguistics. Association for Computational Linguistics, 188–195. http://dx.doi.org/10.3115/1073012.1073037

Garvin, Paul L. 1962. Computer participation in linguistic research. – Language, 38 (4), 385–389. http://dx.doi.org/10.2307/410674

Harris, Zellig S. 1954. Distributional structure. – Word, 10, 146–162. http://dx.doi.org/10.1080/00437956.1954.11659520

Kaalep, Heiki-Jaan; Muischnek, Kadri 2002. Püsiühendite leidmine teksti abil. [Extraction of multiword expressions using text corpus.] – Renate Pajusalu, Tiit Hennoste (Toim.) Tähendusepüüdja: pühendusteos professor Haldur Õimu 60. sünnipäevaks 22. jaanuaril 2002. Catcher of the Meaning: Festschrift for Professor Haldur Õim on the occasion of his 60th birthday. TÜ üldkeeleteaduse õppetooli toimetised 3. Tartu: Tartu Ülikool, 172–184.

Kaalep, Heiki-Jaan; Muischnek; Kadri 2009. Eesti keele püsiühendid arvutilingvistikas: miks ja kuidas. [Estonian multiword expressions in computational linguistics.] – Eesti Rakenduslingvistika Ühingu aastaraamat, 5, 157–172. http://dx.doi.org/10.5128/ERYa5.10

Kallas, Jelena 2013. Eesti keele sisusõnade süntagmaatilised suhted korpus-ja õppeleksikograafias. [Syntagmatic Relationships of Estonian Content Words in Corpus and Pedagogical Lexicography.] Tallinna Ülikooli humanitaarteaduste dissertatsioonid 32. Tallinn: Tallinna Ülikool. http://www.etera.ee/zoom/2000/view?page=3&p=separate&view=0,432,2067,788 (25.2.2016).

Katz, Graham; Giesbrecht, Eugenie 2006. Automatic identification of non-compositional multi-word expressions using latent semantic analysis. – Proceedings of the Workshop on Multiword Expressions: Identifying and Exploiting Underlying Properties. Association for Computational Linguistics, 12–19. http://dx.doi.org/10.3115/1613692.1613696

Kühner, Natalie; Schulte im Walde, Sabine 2010. Determining the degree of compositionality of German particle verbs by clustering approaches. – Proceedings of the 10th Conference on Natural Language Processing, 47–56.

Kumar, Ela 2011. Natural Language Processing. New Delhi–Bangalore: I.K International Publishing House Ltd.

Lin, Dekang 1998. Automatic retrieval and clustering of similar words. – Proceedings of the 36th Annual Meeting of the Association for Computational Linguistics and 17th International Conference on Computational Linguistics, Vol. 2. Association for Computational Linguistics, 768–774. http://dx.doi.org/10.3115/980691.980696

Manning, Christopher D; Schütze, Hinrich 1999. Foundations of Statistical Natural Language Processing. Cambridge (Mass.)–London: MIT press.

McCarthy, Diana; Keller, Bill; Carroll, John 2003. Detecting a continuum of compositionality in phrasal verbs. – Proceedings of the ACL 2003 Workshop on Multiword Expressions: Analysis, acquisition and treatment, Vol. 18. Association for Computational Linguistics, 73–80. http://dx.doi.org/10.3115/1119282.1119292

McCulloch, Warren S.; Pitts, Walter 1943. A logical calculus of the ideas immanent in nervous activity. – The Bulletin of Mathematical Biophysics, 5 (4), 115–133. http://dx.doi.org/10.1007/BF02478259

Mikolov, Tomas; Chen, Kai; Corrado, Greg; Dean, Jeffrey 2013a. Efficient estimation of word representations in vector space. arXiv preprint arXiv:1301.3781 (25.2.2016).

Mikolov, Tomas; Le, Quoc V; Sutskever, Ilya 2013b. Exploiting similarities among languages for machine translation. arXiv preprint arXiv:1309.4168 (25.02.2016)

Padó, Sebastian; Lapata, Mirella 2007. Dependency-based construction of semantic space models. – Computational Linguistics, 33 (2), 161–199. http://dx.doi.org/10.1162/coli.2007.33.2.161

Pärnamaa, Tanel 2015. Piltide automaatne kirjeldamine eesti keeles – visuaalse ja semantilise ühisesituse õppimine neurovõrkudega. [Translating pictures to Estonian – learning shared representations of images and languages using neural networks.] Magistritöö. Käsikiri Tartu ülikooli matemaatilise statistika instituudis. http://hdl.handle.net/10062/47568 (25.2.2016).

Rätsep, Huno 1978. Eesti keele lihtlausete tüübid. [Types of Estonian simple sentences.] Tallinn: Valgus.

R Development CoreTeam 2013. R: A Language and Environment for Statistical Computing. R Foundation for Statistical Computing.

Sag, Ivan A; Baldwin, Timothy; Bond, Francis; Copestake, Ann; Flickinger, Dan 2002. Multiword expressions: A Pain in the neck for NLP. – Alexander Gelbukh (Ed.), Computational Linguistics and Intelligent Text Processing. Third International Conference, CICLing 2002, Mexico City, Mexico, February 17–23, 2002. Proceedings. Lecture Notes in Computer Science 2276. Springer Verlag, 1–15.

Sahlgren, Magnus 2006. The Word-Space Model: Using distributional analysis to represent syntagmatic and paradigmatic relations between words in high-dimensional vector spaces. Stockholm University.

Schone, Patrick; Jurafsky, Daniel 2001. Is knowledge-free induction of multiword unit dictionary headwords a solved problem. – Proceedings of the 2001 Conference on Empirical Methods in Natural Language Processing, 100–108.

Uiboaed, Kristel 2010. Statistilised meetodid murdekorpuse ühendverbide tuvastamisel. [Statistical methods for phrasal verb detection in Estonian dialects.] – Eesti Rakenduslingvistika Ühingu aastaraamat, 6, 307–326. http://dx.doi.org/10.5128/ERYa6.19

Weeds, Julie Elizabeth 2003. Measures and Applications of Lexical Distributional Similarity. Doctoral Dissertation. University of Sussex.




DOI: http://dx.doi.org/10.5128/ERYa12.01

Refbacks

  • There are currently no refbacks.


Copyright (c) 2016 Eleri Aedmaa

Creative Commons License
This work is licensed under a Creative Commons Attribution-NonCommercial 4.0 International License.

ISSN 1736-2563 (print)
ISSN 2228-0677 (online)
DOI 10.5128/ERYa.1736-2563