RuGECToR: rule-based neural network model for russian language grammatical error correction

Capa

Citar

Texto integral

Acesso aberto Acesso aberto
Acesso é fechado Acesso está concedido
Acesso é fechado Somente assinantes

Resumo

Grammatical Error Correction is one of the core Natural Language Processing tasks. At the moment, the open source state-of-the-art Sequence Tagger for English is GECToR model. For the Russian language, this problem does not have solutions with the same good results due to the lack of labeled datasets, therefore we decided to contribute to the aforementioned task. In this research, we described the process of creating a synthetic dataset and training a model on it. We adapt GECToR architecture for Russian language and call it RuGECToR. We use this architecture because, unlike Sequence-to-Sequence approach, it is easy to interpret and does not require a lot of training data. Our goal was to train the model in such a way that it does not adapt to a specific sample, but generalizes the morphological properties of the language. The presented model achieves an F0.5 of 82.5 on synthetic data and an F0.5 of 22.2 on RULEC dataset which was not in the training set.

Sobre autores

I. Khabutdinov

Moscow Institute of Physics and Technology (National Research University); Antiplagiat Company

Autor responsável pela correspondência
Email: khabutdinov@ap-team.ru
Rússia, Institutskiy per. 9, Dolgoprudny, Moscow Region, 141701; Varshavskoe highway 33, Moscow, 117105

А. Chashchin

Antiplagiat Company

Email: chashchin@ap-team.ru
Rússia, Varshavskoe highway 33, Moscow, 117105

А. Grabovoy

Moscow Institute of Physics and Technology (National Research University); Antiplagiat Company

Email: grabovoy@ap-team.ru
Rússia, Institutskiy per. 9, Dolgoprudny, Moscow Region, 141701; Varshavskoe highway 33, Moscow, 117105

А. Kildyakov

Antiplagiat Company

Email: kildyakov@ap-team.ru
Rússia, Varshavskoe highway 33, Moscow, 117105

Y. Chekhovich

Antiplagiat Company; Federal Research Center “Computer Science and Control” of the Russian Academy of Sciences

Email: chehovich@ap-team.ru
Rússia, Varshavskoe highway 33, Moscow, 117105; Vavilova st. 44-2, Moscow, 119333

Bibliografia

  1. Rozovskaya A., Roth D. Grammatical error correction: Machine translation and classifiers // Proceedings of the 54th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers). Berlin, Germany: Association for Computational Linguistics, 2016. P. 2205–2215.
  2. Yuan Z., Stahlberg F., Rei M., Byrne B., Yannakoudakis H. Neural and FST-based approaches to grammatical error correction // Proceedings of the Fourteenth Workshop on Innovative Use of NLP for Building Educational Applications. Florence, Italy: Association for Computational Linguistics, Aug. 2019. P. 228–239. [Online]. https://aclanthology.org/W19-4424
  3. Bryant C., Ng H.T. How far are we from fully automatic high quality grammatical error correction? // ACL, 2015.
  4. Rajput D. Review on recent developments in frequent itemset based document clustering, its research trends and applications // Int. J. Data Anal. Tech. Strateg. 2019. V. 11. P. 176–195.
  5. Flickinger D., Yu J. Toward more precision in correction of grammatical errors // Proceedings of the Seventeenth Conference on Computational Natural Language Learning: Shared Task. Sofia, Bulgaria: Association for Computational Linguistics, 2013. P. 68–73.
  6. Yuan X., Pham D., Davidson S., Yu Z. ErAConD: Error annotated conversational dialog dataset for grammatical error correction // Proceedings of the 2022 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies. Seattle, United States: Association for Computational Linguistics, 2022. P. 76–84.
  7. Sie Yuen Lee J., Seneff S. An analysis of grammatical errors in non-native speech in English // 2008 IEEE Spoken Language Technology Workshop, 2008. P. 89–92.
  8. Zhuravlev K., Rudakov K., Inyakin A., Kirsanov A., Lisitsa A., Nikitov G., Peskov N., Yaminov R., Chekhovich Y. The system of recognition of intellectual text reuse “antiplagiat” // Mathematical methods of pattern recognition: 12th All-Russian conference: Collection of reports. MAKS Press, 2005. P. 329–332.
  9. Keck C.M. How do university students attempt to avoid plagiarism? a grammatical analysis of undergraduate paraphrasing strategies // Writing & Pedagogy. 2010. V. 2. P. 193–222.
  10. Zhang W.E., Sheng Q.Z., Alhazmi A., Li C. Adversarial attacks on deep learning models in natural language processing: A survey // ACM Trans. Intell. Syst. Technol., 2020. V. 11. № 3.
  11. Ng H.T., Wu S.M., Briscoe T., Hadiwinoto C., Susanto R.H., Bryant С. The CoNLL-2014 shared task on grammatical error correction // Proceedings of the Eighteenth Conference on Computational Natural Language Learning: Shared Task. Baltimore, Maryland: Association for Computational Linguistics, Jun. 2014. P. 1–14. [Online]. https://aclanthology.org/W14–1701.
  12. Rozovskaya A., Roth D. Grammar error correction in morphologically rich languages: The case of Russian // Transactions of the Association for Computational Linguistics. 2019. V. 7. P. 1–17.
  13. Rothe S., Mallinson J., Malmi E., Krause S., Severyn A. A simple recipe for multilingual grammatical error correction // Proceedings of the 59th Annual Meeting of the Association for Computational Linguistics and the 11th International Joint Conference on Natural Language Processing (Volume 2: Short Papers). Online: Association for Computational Linguistics. Aug. 2021. P. 702–707.
  14. Grundkiewicz R., Junczys-Dowmunt M., Heafield K. Neural grammatical error correction systems with unsupervised pre-training on synthetic data // Proceedings of the Fourteenth Workshop on Innovative Use of NLP for Building Educational Applications. Florence, Italy: Association for Computational Linguistics, 2019. P. 252–263.
  15. Omelianchuk K., Atrasevych V., Chernodub A., Skurzhanskyi O. GECToR – grammatical error correction: Tag, not rewrite // Proceedings of the Fifteenth Workshop on Innovative Use of NLP for Building Educational Applications. Seattle, WA, USA. Online: Association for Computational Linguistics, 2020. P. 163–170.
  16. Malmi E., Krause S., Rothe S., Mirylenka D., Severyn A. Encode, tag, realize: High-precision text editing // Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing (EMNLP-IJCNLP). Hong Kong, China: Association for Computational Linguistics, 2019. P. 5054–5065.
  17. Vaswani A., Shazeer N., Parmar N., Uszkoreit J., Jones L., Gomez A.N., Kaiser L., Polosukhin I. Attention is all you need // Advances in Neural Information Processing Systems, I. Guyon, U.V. Luxburg, S. Bengio, H. Wallach, R. Fergus, S. Vishwanathan, and R. Garnett, Eds., 2017. V. 30. Curran Associates, Inc.
  18. Korobov M. Morphological analyzer and generator for russian and ukrainian languages // Analysis of Images, Social Networks and Texts, ser. Communications in Computer and Information Science, M. Khachay, N. Konstantinova, A. Panchenko, D. Ignatov, and V. Labunets, Eds. Springer International Publishing. 2015. V. 542. P. 320–332.
  19. Devlin J., Chang M.-W., Lee K., Toutanova K. BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding // Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies. Volume 1 (Long and Short Papers), 2019. P. 4171–4186. Minneapolis, Minnesota. Association for Computational Linguistics.
  20. Vrandečić D., Krötzsch M. Wikidata: A free collaborative knowledgebase // Communications of the ACM. Sep 2014. V. 57. № 10. P. 78–85.
  21. Open source collection of school essays. https://www.kritika24.ru, accessed: 07.11.2022.
  22. Open source collection of literary works. https://proza.ru, accessed: 07.11.2022.
  23. Trinh V.A., Rozovskaya A. New dataset and strong baselines for the grammatical error correction of Russian // Findings of the Association for Computational Linguistics: ACL–IJCNLP 2021. Online: Association for Computational Linguistics, 2021. P. 4103–4111.
  24. Lyashevskaya O., Sharov S. Frequency Dictionary of the Modern Russian Language (based on the materials of the National Corpus of the Russian Language) [in Russian]. M.: Azbukovnik, 2009.
  25. Kingma D.P., Ba J. Adam: A method for stochastic optimization, 2017.

Arquivos suplementares

Arquivos suplementares
Ação
1. JATS XML

Declaração de direitos autorais © Russian Academy of Sciences, 2024