A single customer view (SCV) is an aggregation of all data a business has on its customers. Through SCV, a business can get an overview of every action performed by its customers. It allows for better customer experience, internal process efficiency and improved decision-making abilities.
However, creating an SCV is very challenging. A business usually maintains several databases which are used by different departments. This leads to duplication of records and inconsistent data entry methods. The bigger the business is, the more complicated it is to consolidate its data. In addition, databases contain data noises due to human error such as misspellings, inconsistent names and missing data on some columns.
Applying the traditional rules-based approach of data cleaning and aggregation disregards the noisy data and eventually only comes up with an SCV using a small portion of the data. In this POC, we aim to create an SCV without disregarding noisy data using a machine learning approach.
Related Literature Review
Listed below are some of the techniques being used in cleaning and aggregating data:
Fuzzy Matching Using Edit Distance
Fuzzy matching is used to find strings that approximately match a pattern. It is usually utilized in search queries as it can handle misspellings. It involves a metric called edit distance which is a count of how many edits it would take for the pattern to exactly match a string. The lesser the minimum edit distance between a string and the pattern, the greater the likelihood of being a match.
However, fuzzy matching has a big issue. That is, it is known to include a significant number of false positive especially with a language having an average 5 letter per word like English . To minimize the said error, it is recommended to allow only until an edit distance of 2 as the matching condition.
Graph based model
One of the main issues in aggregating data is to resolve multiple entities of one customer. A graph based approach can solve this problem by treating each entities as a single customer and creating a graph with these entities as vertices and linking these entities that share the same information type (surname, address city, birth month, etc.). To determine which group of entities pertain to one customer, a graph-based clustering algorithm can be applied.
One of the most commonly used algorithm is the connect component analysis. It identifies maximum sets of nodes that are connected with each other by tracing their path of connections . In an ideal scenario, entities of different customers should not share the same information, thus they should not share links. When tracing connections on this type of graph, it should give small sets of links paths from one vertex to another.
A graph is a mathematical representation of object and is very flexible. Numerous graphs can be built from just a single data and different algorithms can be applied to generate insights. However, usage of graph techniques should be done carefully as there are many caveats in the metrics and they have different implications at different types of graphs.
- Create a unified database across various customer and transaction databases
- Generate a unique list of customers by identifying and grouping record duplicates
- Create a model that can handle noisy data and millions of records
We used different transactions and customer databases which include customer details like birthday, name, contact number and address. In total, there are 12 tables and around 19 million records used.
The general idea of the method is to subject the customer records to a filtering process to find the list of likely similar record pairs, vectorize them and calculate the similarity scores of each record pairs. Filtering is necessary to lessen the computational time and resources needed. Vectorizing record details transforms these records into a mathematical representation that describes its structure. Thus, it provides the ability to calculate how similar two records are even if there misspelling and other noises in the data.
In the POC done, we used motif detection, a graph-based approach, to determine the list of record similar pairs and tf-idf to get the vector representation of the record details. The following steps are done:
- Databases were cleaned. Record details were standardized and normalized.
- A unified customer database from transactions and customers tables was created.
- First filter: duplicated records that are of exact match were removed.
- Second filter: Possible duplicate pairs were generated via graph motif. That is, if customers in the record had the same fuzzy surname, city and province, then they were listed as possible duplicate pairs.
- Strings of name and address were transformed into tf-idf vectors separately so the similarity between customer names and addresses could be calculated. The feature used was based on count of a character trigram
- Cosine similarity between tf-idf vectors were calculated. Thus, there are two cosine similarity scores between records — name and address cosine similarity scores.
- Third filter: similarity graph of records were created by thresholding cosine similarities. To get the best groupings, all possible thresholds from 0.5 to 1 must be exhausted.
- Groups of the same entities were obtained via graph connected components.
To validate the model built, we primarily need to have a validation data. That is, a list of record pairs which has indications if the records pairs pertain to the same person or not. As of now, we do not have a validation set at hand. Nonetheless, once we have the validation data, we could compute for the precision and recall scores of the model at different threshold.
Extra: Other insights
One advantage of using graph based methods is that there are different possible ways to represent and combine data that one can infer new underlying insights. In this case, one is by combining the graph for detecting graph motifs and the final similarity graph of records. The resulting graph can identify groups of customers that reflect social relationships such as families and colleagues. This would then generate knew information like how these customers behave by group.
Figure 1 shows the number of records at every step of the POC. There were initially around 19 million records to be processed. After undergoing the first filtering process, the number of records decreased to around 16 million. Meaning, there were around 2 million full row duplicate records in their database. Due to time constraint, only the records which belongs to the top 50 fuzzy surnames were mainly used for the POC.
Figure 3. Number of customers identified at three different matching thresholds.
Figure 3 shows the number of customers identified at three different matching thresholds – loose: 0.8, strict: 0.9, exact: 1.
Since this was a POC project, it was done during a short period of time. There were issues left unresolved and some parts that could have been further improved. Discussed below are the major issues that should be addressed if this project is to be restarted as an R&D project.
First, the model done for the POC was not validated since there was no validation dataset available. Thus, we cannot determine the performance of the model. It is likely that other companies also do not have this type of data. In that case, we have to 1) find a way to collect data while the model is being deployed and 2) find a way to let the model learn, preferably, by itself using the continuously generated validation data. These two functionalities are quite complicated that a lot of time is needed for their development.
Second, tf-idf is the simplest method that can be used and there are many false positive cases found in a sampled output. We need a more sophisticated method. It should still vectorize the customer details to calculate the cosine similarity. In addition, it should be able to handle other data noises and nicknames. It is recommended to start with looking into a method that is similar with siamese network and word2vec.