Eestikeelsete veebitekstide automaatne liigitamine

Kristiina Vaik, Kadri Muischnek

Abstract


Internet on oluline keeleressurss, mille üheks keeleteaduslikuks ja keeletehnoloogiliseks kasutusvõimaluseks on seal leiduvate tekstide koondamine keelekorpuseks. Kuid täisautomaatselt korjatud korpusega seistakse uudse situatsiooni ees: olemas on palju andmeid, ent pole täpselt teada, millist keelematerjali need sisaldavad. Loomuliku keele uurimise ja töötlemise seisukohalt on vajalik tekstide eristamine tekstiliigiti, sest sellest sõltub sobivate töötlusvahendite valik. Artiklis kirjeldame tekstiliikide eristamise lihtsustatud versiooni: korpuse Estonian Web 2013 (etTenTen13) binaarse klassifitseerimise katset, mille eesmärk oli liigitada tekstid kirjakeele normi järgivateks ja mittejärgivateks. Treeningandmetes kasutasime kirjakeele esindajana Tasakaalus korpust ja kirjakeele normi mitte järgivate tekstide esindajana Uue meedia korpust ning testandmetena käsitsi liigitatud Estonian Web 2013 alamkorpust. Klassifitseerimismudelite loomisel rakendasime erinevaid juhendatud masinõppe algoritme ning tunnustena sõne- hulkasid. Klassifitseerimismudelite kvaliteeti hindasime 10-kordse ristvalideerimise teel, kus parima tulemuse andis tehisnärvivõrkudel põhinev algoritm, mis 99% täpsusega liigitas dokumendi õigesse klassi. Seejärel katsetasime mudeleid käsitsi liigitatud Estonian Web 2013 testkorpusel, kus parima tulemuse andis taas tehisnärvivõrkudel põhinev algoritm täpsusega 74%.

***

Classifying Estonian Web texts

Due to the size of the Internet and the multitude of traditional and new genres there has been an increasing interest in automatic genre classification. Labelling texts in natural language processing is essential because this allows us to select more appropriate language models for the analysis. The aim of the article is to describe and present the results of automatically classifying Estonian Web 2013 texts. We evalued the quality of different classification models on our training and manually labelled test set.

Most of the research on automatic classification has focused on classifying multiple genres, while our objective was to do a binary classification. We set out to classify Estonian Web 2013 texts based on whether they are canonical or not. For training we used the Balanced Corpus to represent canonical language and the New Media Corpus to represent non-canonical language. Due to the non-availability of a binary labelled subcorpus of Estonian Web 2013 texts, we compiled it ourselves by manually labelling it. For classification we used different supervised machine learning algorithms and for features a simple Bag of Words method. The results obtained from the preliminary experiments show that neural networks outperformed other machine learning algorithms achieving over 0.7 on accuracy.

The overall results of this study indicate that in order to increase the accuracy of the classifiers, new features should be added (e.g POS count, sentences per paragraph, words per sentence, uppercase and lowercase letters per sentence etc.). Our best model, the neural network classifier, achieved an accuracy of 0.99 on a training set but only a little over 0.74 on the test set. This suggests that future work requiers a bigger and more appropriate training set. The manually labelling task showed us that the transition from canonical to non-canonical is very smooth. Current models produce a score between 0 and 1, defining if the item belongs to a class or not. Therefore, the classification models must be programmed to be more predictive so that the predictions can be tuned by selecting a threshold.



Keywords


corpus linguistics, automatic classification, natural language processing, machine learning, genre, corpus, Estonian

Full Text:

PDF

References


Asheghi, Noushin Rezapour; Sharoff, Serge; Markert Katja 2016. Crowdsourcing for web genre annotation. – Language Resources and Evaluation, 50 (3), 603–641. https://doi.org/10.1007/s10579-015-9331-6

Berninger, Vera; Kim, Yunhyong; Ross, Seamus 2008. Building a document genre corpus: A profile of the KRYS I corpus. – Proceedings of the BCS-IRSG Workshop on Corpus Profiling, London, UK, October 18.

Biber, Douglas 1988. Variation across speech and writing. Cambridge: Cambridge University Press. https://doi.org/10.1017/CBO9780511621024

Biber, Douglas 1995. Dimensions of Register Variation: A Cross-linguistic Comparison. Cambridge: Cambridge University Press. https://doi.org/10.1017/CBO9780511519871

Biber, Douglas; Conrad, Susan 2009. Register, Genre, and Style. Cambridge: Cambridge University Press. https://doi.org/10.1017/CBO9780511814358

Bird, Steven; Klein, Ewan; Loper, Edward 2009. Learning to Classify Text. – Natural Language Processing with Python. http://www.nltk.org/book/ch06.html (10.9.2017).

Crowston, Kevin; Kwaśnik, Barbara; Rubleske, Joseph 2011. Problems in the use-centered development of a taxonomy of web genres. – Genres on the Web: Computational Models and Empirical Studies 42. New York: Springer, 69–84. https://dx.doi.org/10.1007/978-90-481-9178-9_4

Egbert, Jesse; Biber, Douglas 2013. Developing a user-based method of register classification. – Proceedings of the 8th Web as Corpus Workshop, WAC-8 2013, 16–23.

Egbert, Jesse; Biber, Douglas; Davies, Mark 2015. Developing a bottom-up, user-based

method of web register classification. – Journal of the Association for Information Science and Technology, 66 (9), 1817–1831. https://doi.org/10.1002/asi.23308

Haiba, Sabiina 2016. Kuidas on netikeel muutunud aastatel 2001–2008? Bakalaureusetöö. Tartu Ülikool, arvutiteaduse instituut. http://hdl.handle.net/10062/56225

Hennoste, Tiit 2000. Eesti keele allkeeled, – T. Hennoste (Toim.), Tartu Ülikooli eesti keele õppetooli toimetised 16. Tartu: Tartu Ülikooli Kirjastus, 9–57.

Hennoste, Tiit 2013. Kuule ma eemale nüüd. – Sirp (46), 40.

Jakubíček, Miloš; Kilgarriff, Adam; Kovář, Vojtěch; Rychlý, Pavel; Suchomel, Vít 2013. The TenTen Corpus Family. – Lancaster, 7th International Corpus Linguistics Conference CL 2013, 125–127.

Kallas, Jelena; Koppel, Kristina; Tuulik, Maria 2015. Korpusleksikograafia uued võimalused eesti keele kollokatsioonisõnastiku näitel [‘New possibilities in corpus lexicography based on the example of the Estonian Collocations Dictionary’]. – Eesti Rakenduslingvistika Ühingu aastaraamat, 11, 75−94. https://doi.org/10.5128/ERYa11.05

Kasik, Reet 2007. Sissejuhatus tekstiõpetusse. E. Uuspõld (Toim.). Tartu: Tartu Ülikooli Kirjastus.

Laippala, Veronika; Luotolahti, Juhani; Kyröläinen, Aki-Juhani; Salakoski, Tapio; Ginter, Filip 2017. Creating register sub-corpora for the Finnish Internet Parsebank. – Proceedings of the 21st Nordic Conference on Computational Linguistics, NoDaLiDa 2017, Gothenburg, Sweden, 152–161.

Santini, Marina 2007. Automatic Identification of Genre in Web Pages. Dissertation. University of Brighton, Computational Linguistics.

Sharoff, Serge; Wu, Zhili; Markert, Katja 2010. The Web Library of Babel: Evaluating genre collections. – Proceedings of the International Conference on Language Resources and Evaluation, LREC 2010, Valletta, Malta, May 17–23.

Stubbe, Andrea; Ringlstetter, Christoph 2007. Recognizing genres. –Abstract Proceedings of the Colloquium “Towards a reference corpus of web genres”, Birmingham, UK, July 27.

Särg, Dage 2015. Internetikeele süntaktiline analüüs kitsenduste grammatikaga. Magistritöö. Tartu Ülikool. http://hdl.handle.net/10062/47666

Võrgumaterjalid

Estonian Web 2013. http://www2.keeleveeb.ee/dict/corpus/ettenten/about.html (26.9.2017).

Koondkorpus. https://keeleressursid.ee/et/keeleressursid-cl-ut/korpused/83-article/clutee-lehed/192-segakorpus (26.9.2017).

Scikit-learn’i teek. http://scikit-learn.org/stable/ (14.9.2017).

Tasakaalus korpus. https://keeleressursid.ee/et/keeleressursid-cl-ut/korpused/83-article/clutee-lehed/187-grammatikakorpus (26.9.2017).

Uue meedia korpus. https://keeleressursid.ee/et/keeleressursid-cl-ut/korpused/83-article/clutee-lehed/212-koondkorpus-uus-meedia (26.9.2017).

Workshop on Noisy User-Generated Text. http://noisy-text.github.io (27.9.2017).




DOI: http://dx.doi.org/10.5128/ERYa14.13

Refbacks

  • There are currently no refbacks.


Copyright (c) 2018 Kristiina Vaik, Kadri Muischnek

Creative Commons License
This work is licensed under a Creative Commons Attribution-NonCommercial 4.0 International License.

ISSN 1736-2563 (print)
ISSN 2228-0677 (online)
DOI 10.5128/ERYa.1736-2563