Improving data quality with AIdressIT

May 2021 | Research Papers

Introduction

AIdressIT is an AI address solution for parsing/normalizing unstructured, freeform address using NLP and open-source GIS data. The objective of this project is to improve data quality, minimize errors from manual encoding.

By having clean and structured addresses, analysis of transactions across different levels (barangay, city, etc.) will now help customers have a clearer view of the traffic happening in that location. This is helpful for next-step analytic processes like  customer insighting, logistics planning and behavior analysis.

Industry and Academic Benchmarks

  1. USAAddress – Python library for parsing unstructured address strings into USA standardized addresses.
  2. Google Maps Geocoding API – Converts input text into geo-coordinates and tags it by Google Maps standards.
  3. Libpostal – Python library which uses OpenStreetMap tags to parse input addresses.

Project Overview

The end-client, LBC, is reviewing its business processes to make them more efficient and cost-effective. As a market leader in remittances, deliveries, and logistics, improving its delivery system is key in optimizing its operations. To minimize errors and simplify their delivery system, LBC must ensure that correct and accurate address information is provided for each transaction. The tantamount task of reviewing and correcting each address in the delivery system is time-consuming and resource-intensive. LBC therefore is seeking a partner that will enable them to clean and improve  their database in a more efficient and effective manner. Through Onshoring, Neural Mechanics therefore proposes the project AlddressIT to address this business need. 

The proposed project, AIddressIT, offers a data quality system that uses AI to improve accuracy and correctness of information of LBC’s delivery system. AlddressIT aims to augment the customer data sets or the addresses with external data to improve the delivery system database. This will help Onshoring solve LBC’s logistics and billing process problems by providing an alternative to an otherwise heavily manual database cleanup.

Dataset Description

LBC provided customer transactions as POC data to validate the cleaning performance of AIdressIT. The POC data set contains 2,500 transactions. Due to the limited time given, only the recipient’s address is used as input for AIdressIT. 

Note: Specific transactions were ignored because address information was not filled up (ie, bills payment, E-LOAD). In total, 1,849 addresses were processed.

Further breakdown of the data shows (1) data incompleteness on certain fields (i.e, no barangay, no city), (2) data misplacement on their respective fields (i.e, full address is in street field)

Architecture Overview

The architecture of AIddressIT consists of 3 major parts, namely: ETL/CEP Engine, Rules Engine, and Analytics Engine. 

The Extract, Transform, Load (ETL) and Complex Events Processor (CEP) Engine is a high-speed processor which takes events from multiple data sources and prepares the data for the AIddressIT to use. Data preparation could involve extracting, parsing (analyzing a set of textual elements), cleansing, and transforming the data.

The Rules Engine stores and implements the various rules that have been defined by Neural Mechanics. This engine is also responsible for triggering actions when certain conditions are met.  Finally, the Rules Engine will be responsible for cleaning up LBC’s addresses.

The Analytics Engine discovers, interprets, and communicates meaningful patterns in data. This engine is especially valuable in areas rich with recorded information, and relies on the simultaneous application of statistics, computer programming, and operations research to quantify performance and prediction. The Analytics Engine will be responsible for AIddressIT’s address prediction to ensure that all succeeding encoded addresses are of the correct format.

Process Flow

Step 1.)  LBC pushes data to NM’s API and enters the ETL/CEP Engine. ETL/CEP Engine prepares the data for the Rules Engine’s consumption.

Step 2.) The Rules Engine first identifies the data it receives and applies the rules based on the data received. These rules are maintained by NM as seen in 5a. It is expected that the more data LBC throws to the API, the more rules will be needed to generate the required address syntax defined by LBC.

Step 3 & 3a.) NM will store the cleaned data it from the Rules engine in a MySQL database.

Step 3b.) A cleaned and corrected CSV file will be thrown back to LBC for their consumption. LBC will specify where the CSV shall be dropped.

Step 4.) Cleaned data from the Rules Engine and MySQL database can be used to predict addresses when LBC uses the API for encoding new addresses. This means that the API has the capability to suggest what address the encoder is typing, which helps with the normalization of LBC’s data in the future.

Step 5, 5a, 5b.) Apart from developing the API to manage and clean up LBC’s addresses, NM will also need to augment the rules engine with new rules as the addresses pour in from LBC’s databases. NM will need to maintain and enhance the predictive model to improve its suggestions during address input. The prediction can come later into the project when NM has more historical data from LBC to aid the prediction model.

Results

AIdressIT got a 76.8% cleanup rate for completely filled address. Difficulties encountered by the solution were misspelled address details and misplaced address information (i.e, barangay misplace in street field)

For incomplete data, AIdressIT got a 61.8% cleanup rate for addresses with street and barangay only. This means that the solution can fill up incomplete portions of the address using only street and barangay information. Relying on street alone, clean up rate was at 54.8%

Results show that even in incomplete address information, which is expected for manual address encoding, the solution can still standardize addresses by using certain fields like street and barangay to predict higher level geographic units.

Visualization Examples

Visualization is done via Google Maps API.

Visualization of traffic using province information from cleaned recipient addresses

Visualization of  “hotspots” using city information from cleaned recipient addresses

Project Deliverables

The development of AIddressIT aims to deliver the following:

Project Infrastructure and Requirements

In order for the models built by Neural Mechanics to run properly, the Client must acquire the Oracle Cloud Infrastructure: Block Storage Classic for Oracle Public Cloud with 2TB.

Oracle Cloud is a cloud computing service offered by Oracle Corporation providing servers, storage, network, applications, and services through a global network of Oracle Corporation managed data centers. The company allows these services to be provisioned on demand over the Internet.

Oracle Cloud provides Infrastructure as a Service (IaaS), Platform as a Service (PaaS), Software as a Service (SaaS), and Data as a Service (DaaS). These services are used to build, deploy, integrate, and extend applications in the cloud. This platform supports numerous open standards (SQL, HTML5, REST, etc.), open-source solutions (Kubernetes, Hadoop, Kafka, etc.), and a variety of programming languages, databases, tools, and frameworks including Oracle-specific, Open Source, and third-party software and systems.

Timeline

The estimated timeline for the development of the system is shown in the table below. 

The proposed project will be implemented for a duration of around thirteen (13) weeks.