PhilHealth Fraud Blue Paper

Jun 2021 | Research Papers


In today’s world, there is huge amount of data stored in real world databases and this amount continues to grow fast. Traditional methods of detecting health care fraud and abuse are time-consuming and inefficient. Thus, statistical data mining methods effectively consider big data for identifying structures (variables) with the appropriate predictive power in order to yield reliable and robust large-scale statistical models and analyses [5].               

The net effect of excessive fraudulent claims is excessive billing amounts, higher per-patient costs, excessive per-doctor patients, higher per-patient tests, and so on. This excess can be identified using special analytical tools.

Provider statistics include; total number of patients, total amount billed, total number of patient visits, per-patient average visit number, per-patient average billing amounts, per-patient average medical test costs, per-patient average medical tests, per-patient average prescription ratios (of specially monitored drugs) and many more [7].            

There will be different approaches in catering this problem. Since it is apparent that we don’t know the structure of the dataset, various methods will be proposed to cater the needs that best suited for Philhealth and will be deployed for exploration.     

Related Literature

Industry and Academic Benchmarks


Anomaly and fraud detection had been a long agony of Philhealth that is robbing the agency billions of pesos in much-needed funds, and eventually harming its sustainability over the long term. The main objective of this proposal is to identify fraud and fraud indicator in Philhealth claims dataset. Which could successfully distinguish between the fraudulent and non-fraudulent claims. Since there is no apparent structure in the dataset, different methods will be applied.


Supervised learning is the most usual learning technique wherein the model is trained using pre-defined class labels. In the context of health insurance, fraud detection the class labels may be the “legitimate” and “fraudulent” claims. Unsupervised learning has no class labels. It focuses on finding those instances which show unusual behavior. Unsupervised learning techniques can discover both old and new types of fraud since they are not restricted to the fraud patterns which already have pre-defined class labels like supervised learning techniques do. It finds “natural” grouping of instances given unlabeled data. Compared to supervised health care fraud detection methods, which centralized on MLP neural networks, unsupervised health care fraud detection method various a lot, ranging from self-organizing map, association rules, clustering, to rule-based unsupervised methods.           

Major drawback of supervised and unsupervised techniques is that the former cannot classify claims of an unknown disease while the latter cannot detect outliers when duplicate claims i.e. claims with different dates are filed. So, in this section we propose a hybrid method for detecting health insurance frauds [15]. For this, we will choose Evolving Clustering Method (ECM) for clustering because the data is dynamic and new data is generated continuously and Support Vector Machine (SVM) for classification. In this approach, first, the claims are clustered according to the disease type and then they are classified to detect any duplicate claims. So, ECM and SVM will be explained in the following sections.

Evolving Clustering Method (ECM)

ECM is used to cluster dynamic data. Dynamic data are those which keep on changing with respect to time. As and when new data point comes in, ECM clusters them by modifying the position and size of the cluster. There is a parameter known as radius associated with each cluster that determines the boundaries of that cluster. Initially, the cluster radius is set to zero. The radius of the cluster increases as more data points are added to that cluster. It has one more parameter known as the distance threshold, which determines the addition of clusters [17]. If the threshold value is small then, there will be more number of small clusters and if the value is large, then there will less number of large clusters. Selection of the threshold is dependent on the heuristics od the data points. Figure 2 shows the flowchart of ECM.

Support Vector Machine (SVM)

A support vector machine is a supervised learning technique used in classification. It has an initial training phase where data that has already been classified is fed to the algorithm. After the training phase is finished, SVM can predict into which class the new incoming data will fall into [16].               

SVM Steps:

  1. Training (Preprocessing Step):
    • Define two class labels viz. “legitimate” or “fraudulent”
    • Classify claims into two classes using the training data set.
    • Choose support vectors and find the marginal hyperplane that separates the claims into two classes.
  2. Classification:
  3. Identify the new incoming claims into either “legitimate” or “fraudulent” class.

Considering ECM and SVM, Figure 3 shows the block diagram for the hybrid model of fraud detection.


Steps in Hybrid Model Construction:

  • Doctor bills patients for the services/equipment given to them during their treatment.
  • Patient files claims to the Philhealth.
  • Claims are submitted to the Hybrid Framework wherein clustering (ECM) is followed by classification (SVM) to detect fraudulent claims.
  • There is an expert who flags the fraudulent claims for further investigation with Philhealth.
  • The legitimate claims are further passed to Philhealth and those claims are paid to the patients.


Pseudo Code for the Hybrid Approach:

  • For each of the incoming Philhealth claim, apply ECM to form clusters according to the disease type.
  • Apply SVM to each of the clusters to classify those claims into “legitimate” and “fraudulent” classes.
  • Go back to clustering step to cluster new claims and repeat.


Figure 3 illustrates a Natural Language Processing (NLP) approach for fraud detection. Claim Documents is considered as an input. In order to detect fraudulent claims/reports, documents of both types or classifications (fraudulent or non-fraudulent) need to be available.

Using NLP techniques, information regarding the occurrences of each word in the document/s will be retained and used as a feature for training a classifier. The model from this step represents each document with a vector of word count that appears in a document. The vector associated with each document is compared with typical vector associated with each document is compared with a given class (fraud or non-fraud).

The vector spaces generated will be used in SVM (Support Vector Machines) for classifying documents into fraud or non-fraud. SVM takes a set of input data and predicts, for each marked as belonging to one of two categories, as SVM training algorithm builds a model that assigns new examples into one category or the other. The accurateness of the classification will be evaluated by using evaluation measures such as accuracy, precision, recall (sensitivity in binary classification), F-measure and purity.