Authorship verification of opinion pieces in Estonian

Timo Petmanson


Authorship verification is an important subproblem in authorship attribution and plagiarism detection tasks. We present a novel approach for extracting stylistic features unique to individual authors. We use the correlations of important textual features as a way to learn the style. The goal of our proposed method is to answer the following question: given a set of documents known to be written by the same person and an unknown document, is the unknown document also written by that individual. We present the first study of this problem conducted on opinion pieces written in Estonian. Our method achieves 74% precision, which is comparable with current state-of-the-art systems tested in other languages, whereas the recall level is still something to be improved on.



natural language processing; text analysis; linguistic expertise; machine learning; pattern mining; feature correlations; Estonian

Full Text:



Abbasi, Ahmed; Chen, Hsinchun 2005. Applying authorship analysis to extremist-group web forum messages. – Intelligent Systems, IEEE, 20 (5), 67–75.

Argamon, Shlomo; Juola, Patrick 2011. Overview of the International Authorship Identification Competition at PAN-2011. – V. Petras, P. Forner, P. D. Clough (Eds.). CLEF 2011 Labs and Workshop, Notebook Papers, 19-22 September 2011, Amsterdam, The Netherlands. (1.2.2014).

Brocardo, Marcelo Luiz; Traore, Issa; Saad, Sherif; Woungang, Isaac 2013. Authorship verification for short messages using stylometry. – Proceedings of the IEEE 2013 International Conference on Computer, Information and Telecommunication Systems (CITS), 1–6.

Cortes, Corinna; Vapnik, Vladimir 1995. Support-vector networks. – Machine Learning, 20 (3), 273–297.

De Vel, Olivier; Anderson, Alison; Corney, Malcolm; Mohay, George 2001. Mining e-mail content for author identification forensics. – ACM Sigmod Record, 30 (4), 55–64.

Deerwester, Scott; Dumais, Susan; Landauer, Thomas; Furnas George; Harshman, Richard 1990. Indexing by latent semantic analysis. – Journal of the American Society for Information Science (JASIS), 41 (6), 391–407. (1.2.2014).

Frantzeskou, Georgia; Gritzalis, Stefanos; MacDonell, Stephen 2004. Source code authorship analysis for supporting the cybercrime investigation process. – Proceedings of the 1st International Conference on e-business and Telecommunications Networks (ICETE04), Setúbal, Portugal, 85–92. (1.2.2014).

Houvardas, John; Stamatatos, Efstathios 2006. N-gram feature selection for authorship identification. – Jérôme Euzenat, John Domingue (Eds.). Artificial Intelligence: Methodology, Systems, and Applications. 12th International Conference, AIMSA 2006, Varna, Bulgaria, September 12-15, 2006. Proceedings. Lecture Notes in Computer Science 4183. Berlin, Heidelberg: Springer, 77–86.

Inches, Giacomol; Crestani, Fabio 2012. Overview of the International Sexual Predator Identification Competition at PAN-2012. – P. Forner, J. Karlgren, C. Womser-Hacker (Eds.). CLEF 2012 Evaluation Labs and Workshop, Online Working Notes, Rome, Italy, September 17-20. (1.2.2014).

Juola, Patrick; Stamatos, Efsthathios 2013. Overview of the Author Identification Task at PAN 2013. – P. Forner, H. Müller, R. Paredes, P. Rosso, B. Stein (Eds.). Information Access Evaluation. Multilinguality, Multimodality, and Visualization. 4th International Conference of the CLEF Initiative, CLEF 2013, Valencia, Spain, September 23-26, 2013. Proceedings. Springer 2013. (29.09.2013)

Kaalep, Heiki-Jaan 1997. An Estonian morphological analyser and the impact of a corpus on its development. – Computers and the Humanities, 31 (2), 115–133.

Kaalep, Heiki-Jaan; Vaino, Tarmo 1998. Kas vale meetodiga õiged tulemused? Eesti keele morfoloogiline ühestamine statistika abil. – Keel ja Kirjandus, 1, 30–38.

Koppel, Moshe; Argamon, Shlomo; Shimoni, Anat Rachel 2002. Automatically categorizing written texts by author gender. – Literary and Linguistic Computing, 17 (4), 401–412.

Koppel, Moshe; Schler, Jonathan; Argamon, Shlomo 2009. Computational methods in authorship attribution. – Journal of the American Society for information Science and Technology, 60 (1), 9–26.

Koppel, Moshe; Winter, Yaron 2014. Determining if Two Documents are by the Same Author. – Journal of the American Society for Information Science and Technology. Journal of the Association for Information Science and Technology, 65 (1), 178–187.

Langemets, Margit; Voll, Piret 2008. Sõnaraamatu kohtulingvistiline analüüs: Eesti pretsedent. [Linguistic forensic analysis of a dictionary: an Estonian precedent.] – Eesti Rakenduslingvistika Ühingu aastaraamat, 4, 67–86.

Pedregosa, Fabian; Varoquaux, Gaël; Gramfort, Alexandre; Michel, Vincent; Thirion, Bertrand; Grisel, Olivier; Blondel, Mathieu; Prettenhofer, Peter; Weiss, Ron; Dubourg, Vincent; Vanderplas, Jake; Passos, Alexandre; Cornapeau, David 2011. Scikit-learn: Machine learning in Python. – The Journal of Machine Learning Research, 12, 2825–2830.

Petmanson, Timo; Laur, Sven 2012. Pattern based fact extraction from Estonian texts. National Programme for Estonian Language Technology. Project Report. University of Tartu. February 17, 2012. (1.2.2014).

Stamatatos, Efstathios 2009. A survey of modern authorship attribution methods. – Journal of the American Society for information Science and Technology, 60 (3), 538–556.



  • There are currently no refbacks.

Copyright (c) 2014 Timo Petmanson

Creative Commons License
This work is licensed under a Creative Commons Attribution-NonCommercial 4.0 International License.

ISSN 1736-2563 (print)
ISSN 2228-0677 (online)
DOI 10.5128/ERYa.1736-2563