Choosing the Optimal Segmentation Level for POS Tagging of the Quranic Arabic

Fadl Mutaher Ba-Alwi *

Faculty of Computer and Information Technology, Sana'a University, P.O.Box 1247, Yemen

Mohammed Albared

Faculty of Computer and Information Technology, Sana'a University, P.O.Box 1247, Yemen

Tareq Al-Moslmi

Faculty of Information Science and Technology, Universiti Kebangsaan Malaysia, Malaysia

*Author to whom correspondence should be addressed.


Abstract

As a morphologically rich language, Arabic poses special challenges to Part-of-Speech (POS) tagging. Words in Arabic texts often contain several segments; each has its own POS category. The choice of the segmentation level or the input unit, word-based or morpheme-based, is a major issue in designing any Arabic natural language processing system. In word-based approaches, words are used the atomic units of the language. In this case, composite POS tags are assigned to words. Therefore, large amounts of training data are required in order to ensure statistical significance. They suffer from the problems of data sparseness and unknown words. In case of morpheme-based approaches, morpheme components of words are used as the atomic units.  This, however, results in high level of ambiguity rate and also small size of context for resolving such ambiguity because the span of the n-gram might be limited to a single word. This paper compares and contrasts the morpheme-based and word-based statistical POS tagging strategies. This paper evaluates  the  tagging  performance  of  three  statistical  models,  namely,  the  Arabic  HMM  POS  tagger  with  the  prefix  guessing  models, the Arabic HMM  POS tagger  with  the  linear  interpolation guessing models and the TnT tagger, given training data from both morpheme-based and word-based tokenization levels. It also studies the influence of each choice on the tagging performance of the Arabic POS tagging models, in terms of the tagging accuracy and the time complexity. In addition, this paper also evaluates the tagging performance of several stochastic models, given training data from both segmentation levels. Results show that the morpheme-based POS tagging strategy is more adequate for the purpose of training statistical POS tagging models as it provides a better overall tagging accuracy and a much faster training and tagging time.

 

Keywords: Arabic natural language processing, POS tagging, segmentation levels


How to Cite

Ba-Alwi, Fadl Mutaher, Mohammed Albared, and Tareq Al-Moslmi. 2017. “Choosing the Optimal Segmentation Level for POS Tagging of the Quranic Arabic”. Current Journal of Applied Science and Technology 19 (1):1-10. https://doi.org/10.9734/BJAST/2017/29754.

Downloads

Download data is not yet available.