Filipino Corpus & Natural Language Processing

Jun 2021 | Research Papers

Introduction

Natural Language Processing, the field of applying machine learning to data represented in human languages, is one of the most well-developed fields of machine learning and a core competency of Neural Mechanics. However, so far most learning algorithms in the field have been developed using corpora in the English language or other popular languages such as Spanish or Arabic. Most of NM’s clientele for NLP methods want to analyze Philippine social media data, a large majority of which is written in Filipino. This presents an opportunity for NM to develop new methods in this area to improve internal and academic NLP knowledge, to boost the analytical capabilities we offer our clients.

Project Overview

The overall project can be divided into two parts:

Building Filipino Language Corpora
The first step will be to build a large internal data set for Filipino language tweets. Although this has been done for specific projects before, the already collected data can now be put together for R&D use.

Testing NLP on Filipino Language
The second step will be to test and design various NLP methods (in particular, stemmers and lemmatizers) on the collected data. This will include existing English or language-agnostic methods, as well as methods that can process mixed languages.

Project Deliverables

The project aims to:

Collect a large data set of Filipino language documents.
Design a set of NLP methods that can analyze these documents.
Presenting possible business applications with the analyzed Filipino Corpus data.

Related Work

In [1] a linguistics-based Tagalog stemmer using iterative affix removal was presented, but the test in the study showed no comparisons to the performance of any other stemmer on any language. There currently seems to be no available lemmatizer for Tagalog, and [2] in particular recommends the development of one to improve document classification performance.

A multilingual stemmer based on a statistical approach was presented in [3], and a data-driven multilingual lemmatizer was built in [4] using the Helsinki Finite-State Technology (HFST) library and Wiktionary entries as inputs. The lemmatizer performs better than the lexicon-based TreeTagger but only when HSFT supports the language. This might be a useful approach should we be able to find or build HSFT support for Tagalog, but a ground-up lexicon-based lemmatizer might also be more effective given our focus on English and Tagalog specifically.

Project Plan

Natural Language Processing (NLP) methods are among the top techniques in modern data analytics and a backbone of Neural Mechanics’ capabilities. However, most methods so far have been developed over only the English language or other commonly spoken languages. This necessitates the development of Filipino language techniques to help improve NM’s analytics capabilities within their business and societal context.

Building Filipino Language Corpora
To help facilitate the development of Filipino language NLP methods, Filipino language data must be collected and organized into corpora to serve as an input to learning algorithms. Collecting these corpora will occur in two phases, described below:

Acquiring Existing Corpora
The first step will be to look for pre-existing Filipino corpora used in previous academic studies. Such studies will not only have collected documents in Filipino language, but also possibly labels for sentiment and other variables based on the study they were used for.
These corpora are best suited for fast testing of existing NLP methods already in use at NM, as labelling will already have been done on the data and the data set size should be small enough for easy ETL. At this stage, modifications to the existing methods can also be done to improve Filipino language performance of the existing algorithms.

Assembling Internal Corpus of Scraped Data
Although existing data is useful for initial algorithm development and testing, most such data sets used in previous studies will not be at the scale of “big data”, which is necessary to develop serious machine learning algorithms. The second step will be to assemble an internal corpus via scraping Philippine-based social media accounts on Facebook, Twitter, and Instagram. A large set of data has already been collected for other projects, so it will be sufficient to just reorganize the existing data into new corpora and scrape more only when necessary for this project or any other project.

The collected data will mainly be used as unlabeled data to help developing Filipino NLP tools. However, should other projects require using this data and adding labels to it, those labels can also serve R&D development of additional supervised learning techniques. Should it be useful or necessary, crowdsourcing can work as a way to quickly label large amounts of data.

Testing NLP on Filipino Language
With the corpora collected above, we can then proceed on developing Filipino language NLP as follows:

Testing English-validated Methods
The first step will be to directly apply existing English language methods already in use to tag the acquired Filipino corpora. This will include using English language methods as well as language-agnostic NLP methods. We expect that English language methods, even including stemmers and lemmatizers, will perform only slightly worse given how a significant chunk of Filipino social media is written with English grammar and vocabulary The performance of these models will be recorded and used as both a comparison between Filipino and English corpora as well as a benchmark for future development. We plan to tweak parameters in these models to optimize their performance before working on Filipino-specific models, as the additional benefit of bespoke models compared to tweaked models may be small enough to justify reducing development.

Building Tagalog NLP Methods
The second step will be to build Tagalog NLP methods. A subset of the collected corpus that is mainly in Tagalog only (as opposed to Filipino) will be extracted to develop various parsing methods.

The project will focus on two particular NLP tools that simplify raw text into forms that can be tokenized and used as an input for token-based NLP methods: Stemmers and Lemmatizers. Both methods are very similar in that they attempt to simplify words in the text to their root forms to make the overall semantics of the text clearer to ML algorithms. The difference is that stemmers do this by finding and deleting all common prefixes or suffixes in the language it finds in the text, while lemmatizers do this conducting a full morphological analysis of each word to extract the lemma or root word. Since constructing a lemmatizer may require more advanced linguistic knowledge than what a data scientist may have, partnering with a computational linguistics research may be essential in building these tools.

Building Filipino NLP Methods
The third and final step will be to build methods that will be able to process mixed-language documents (in regards to the diversity of the Filipino language). This is the main goal of the overall project, as virtually 100% of all Philippine based documents on social media are actually written in Taglish. Taglish and other Philippine languages help formulate the modern Filipino language.

An issue with the Filipino language as it is on social media is that it is almost all mixed-language: Filipinos frequently use English grammar with Tagalog vocabulary, Tagalog grammar with English vocabulary, or even use loan words from Spanish or other Filipino dialects. This will require developing stemmers / lemmatizers that can handle a wide variety of mixed languages.

Note that the stemmer / lemmatizer will be developed after developing Tagalog-specific methods since a likely methodology will be using what we learned from developing English and Tagalog stemmers / lemmatizers. Furthermore, this method opens up the possibility of developing mixed language methods for more languages.