DEVELOPMENT OF A STEMMING ALGORITHM BASED ON A LINGUISTIC APPROACH FOR WORDS OF THE UZBEK LANGUAGE
Keywords:language, algorithms, lexicon, tokenization
The automatic processing of unstructured texts in natural languages is one of the relevant problems of computer analysis and text synthesis. Within this problem, the author singles out a task of text normalization, which usually suggests such processes as tokenization, stemming, and lemmatization. The existing stemming algorithms for the most part are oriented towards the synthetic languages with inflectional morphemes. The Uzbek language represents an example of agglutinative language, characterized by polysemanticity of affixal and auxiliary morphemes. Although the Uzbek language largely differs from, for example, English language, it is successfully processed by stemming algorithms. There are virtually no examples of effective implementation of stemming algorithms for the Uzbek language; therefore, this questions are the subject of scientific interest and defines the goal of this work. In the course of this research, the author solved the task of bringing the given texts in the Uzbek language to normal form, which on the preliminary stage were tokenized and cleared of stop words. To author developed the method of normalization of texts in the Uzbek language based on the stemming algorithm. The development of stemming algorithm employed hybrid approach with application of algorithmic method, the lexicon of linguistic rules and database of the normal word forms of the Uzbek language. The precision of the proposed algorithm depends on the precision of the tokenization algorithm. At the same time, the article did not explore the question of finding the roots of paired words separated by spaces, as this task is solved at the stage of tokenization. The algorithm can be integrated into various automated systems for machine translation, information extraction, data retrieval, etc.
This work is licensed under a Creative Commons Attribution-NonCommercial-NoDerivatives 4.0 International License.