Bi-directional Maximal Matching Algorithm to Segment Khmer Words in Sentence


Makara Mao, Sony Peng, Yixuan Yang, and Doo-Soon Park, Journal of Information Processing Systems Vol. 18, No. 4, pp. 549-561, Aug. 2022  

10.3745/JIPS.04.0250
Keywords: Bi-directional Maximal Matching, Khmer Language, Natural Language Processing, Word Corpus, Word Segmentation
Fulltext:

Abstract

In the Khmer writing system, the Khmer script is the official letter of Cambodia, written from left to right without a space separator; it is complicated and requires more analysis studies. Without clear standard guidelines, a space separator in the Khmer language is used inconsistently and informally to separate words in sentences. Therefore, a segmented method should be discussed with the combination of the future Khmer natural language processing (NLP) to define the appropriate rule for Khmer sentences. The critical process in NLP with the capability of extensive data language analysis necessitates applying in this scenario. One of the essential components in Khmer language processing is how to split the word into a series of sentences and count the words used in the sentences. Currently, Microsoft Word cannot count Khmer words correctly. So, this study presents a systematic library to segment Khmer phrases using the bi-directional maximal matching (BiMM) method to address these problematic constraints. In the BiMM algorithm, the paper focuses on the Bidirectional implementation of forward maximal matching (FMM) and backward maximal matching (BMM) to improve word segmentation accuracy. A digital or prefix tree of data structure algorithm, also known as a trie, enhances the segmentation accuracy procedure by finding the children of each word parent node. The accuracy of BiMM is higher than using FMM or BMM independently; moreover, the proposed approach improves dictionary structures and reduces the number of errors. The result of this study can reduce the error by 8.57% compared to FMM and BFF algorithms with 94,807 Khmer words.


Statistics
Show / Hide Statistics

Statistics (Cumulative Counts from November 1st, 2017)
Multiple requests among the same browser session are counted as one view.
If you mouse over a chart, the values of data points will be shown.




Cite this article
[APA Style]
Makara Mao, Sony Peng, Yixuan Yang, & and Doo-Soon Park (2022). Bi-directional Maximal Matching Algorithm to Segment Khmer Words in Sentence. Journal of Information Processing Systems, 18(4), 549-561. DOI: 10.3745/JIPS.04.0250.

[IEEE Style]
M. Mao, S. Peng, Y. Yang and a. D. Park, "Bi-directional Maximal Matching Algorithm to Segment Khmer Words in Sentence," Journal of Information Processing Systems, vol. 18, no. 4, pp. 549-561, 2022. DOI: 10.3745/JIPS.04.0250.

[ACM Style]
Makara Mao, Sony Peng, Yixuan Yang, and and Doo-Soon Park. 2022. Bi-directional Maximal Matching Algorithm to Segment Khmer Words in Sentence. Journal of Information Processing Systems, 18, 4, (2022), 549-561. DOI: 10.3745/JIPS.04.0250.