Uzjournal

Sing in

You haven't registered yet? Registration

You Are you an editor? change

Abstract

The automatic identification of syntactic roles remains one of the most challenging tasks in Natural Language Processing (NLP) for low-resource, morphologically rich languages. This paper presents a hybrid algorithm and a software pipeline architecture specifically designed for automatically identifying Objects in Uzbek texts. The Object is a key syntactic component that indicates the entity upon which the predicate's action is directed, and its correct detection is critical for downstream tasks such as machine translation, information extraction, and question answering. The proposed solution is structured as a three-stage pipeline: (1) customized tokenization tailored for Uzbek compound words and punctuation patterns, (2) transformer-based part-of-speech (POS) tagging that leverages contextual embeddings to resolve morphological ambiguities, and (3) syntactic role extraction using a deterministic rule-based syntactic analyzer. To stabilize Object detection, a Predicate (verb) identification module was introduced into the system as an auxiliary anchor component: the Predicate is first identified using 6 formal rules, and Objects are then labeled using 7 dedicated rules that exploit case suffixes, postpositional constructions, and contextual conditions relative to the Predicate. These 7 rules collectively cover the major object-marking patterns in Uzbek, including accusative case suffixes (-ni), dative/locative/ablative suffixes (-ga, -da, -dan), postpositional constructions (bilan, haqida, uchun, etc.), substantivized forms, and pronominal objects.

References

[1]. A. Vaswani, N. Shazeer, N. Parmar, J. Uszkoreit, L. Jones, A. N. Gomez, L. Kaiser, and I. Polosukhin, "Attention is all you need," in Advances in Neural Information Processing Systems (NeurIPS), 2017, pp. 5998–6008.
[2]. K. Oflazer, "Two-level description of Turkish morphology," IBM Journal of Research and Development, vol. 38, no. 4, pp. 357–382, 1994.
[3]. Ç. Çöltekin, "A freely available morphological analyzer for Turkish," in Proceedings of the 7th International Conference on Language Resources and Evaluation (LREC 2010), 2010, pp. 820–827.
[4]. A. G'ulomov and M. Asqarova, Hozirgi O'zbek Adabiy Tili: Sintaksis. Toshkent: O'qituvchi, 1987.
[5]. G. Eryiğit and E. Adalı, "An affix-based decoder for Turkish," in Proceedings of the 5th International Conference on Turkish Linguistics, 2004, pp. 61–65.
[6]. J. Nivre, M. de Marneffe, F. Ginter, Y. Goldberg, J. Hajič, et al., "Universal Dependencies v1: A multilingual treebank collection," in Proceedings of the 10th International Conference on Language Resources and Evaluation (LREC 2016), 2016, pp. 1659–1667.
[7]. B. Mansurov and A. Mansurov, "UzBERT: A pre-trained language model for Uzbek," Preprint arXiv:2108.01757, 2021.
[8]. M. Sharipov, I. Avezmatov, and H. Adinaev, "Development of a rule-based model and algorithm for predicate identification in Uzbek language texts," in 10th International Conference on Computer Science and Engineering (UBMK), IEEE, 2025, pp. 594–598.
[9]. M. Sharipov and J. Vičič, "Dataset of Uzbek verbs with formation and suffixes," Data in Brief, vol. 61, 2025. DOI: 10.1016/j.dib.2025.111731.
[10]. M. Sharipov, E. Kuriyozov, O. Yuldashev, and O. Sobirov, "UzbekVerbDetection: Rule-based detection of verbs in Uzbek texts," in Proceedings of the Joint International Conference on Computational Linguistics, Language Resources and Evaluation (LREC-COLING 2024), 2024, pp. 17343–17347.