Part-of-Speech Tagger For Malay Language

By Dr. Rosyzie Anna Apong


Part-of-Speech (POS) tagging is a process in the Natural Language Processing (NLP) pipeline to tag a single word, e.g., ‘computer’ is labelled by a tag ‘noun’. It is a basic task in the pipeline but is necessary as the output is a vital input for other applications of NLP such as Named-Entity Recognition (NER) and Machine Translation.

Existing POS taggers is mostly available for rich-sourced languages such as English. Linguistic resources in Malay language are still lacking hence less research being done such as for POS tagging process. Although Indonesian POS taggers have been developed but there still exist lexical difference between Malay (used by Brunei, Malaysia, and Singapore) and Indonesian Malay. This leads to inaccurate tagging when using these taggers for Malay language. Previous researches have successfully used traditional methods such as Hidden Markov Model (HMM), Rule-based and Maximum Entropy (MaxEnt) and reached 67%-95% accuracy on Malay language.

Due to recent advancement in deep learning networks (DNN) to learn and recognize complex patterns, they have been gaining in popularity. Various areas have incorporated deep learning including sequence labeling and speech recognition. Some researchers have developed POS taggers for languages such as Marathi (Indian language) and Chinese reaching an average of 97%. Therefore, this work proposes to develop a Malay POS tagger that can tag Malay words accurately. However, due to insufficient linguistic resources for Malay language, the corpus is manually built by extracting words from Malay text articles such as news articles.