Automation for ADB Concept Papers

May 2021 | Research Papers

Introduction

Concept papers are developed by ADB to improve its organization. These documents are used to aid in conceptualizing a project that is approved by director to provide a developmental loan or grant to the host country. Generation of concept papers can take anywhere from a few weeks to a month. We aim to help ADB in automating the generation of concept papers. The product will be based on past and present knowledge found on ADB documents, as well as external media outlets.

Concept paper generation

ADB creates concept papers by manually curating ADB-based documents and online materials (see Figure X). Generation of concept papers and problem trees come after the selection of sources, which would respectively take 2 to 4 weeks and 3 to 5 days. We aim to reduce the generation time to about 30 minutes by utilizing a knowledge tree. This database will learn from the ADB-based documents and online materials.

Related Literature Review

Industry and Academic Benchmarks

Various companies offer AI solutions on content creation; some of these include:

  • Articloo : Creates texts by analyzing the user input topic, extracting the sentiment and important keywords related to the topic and reconstruct everything into on text.

(website: http://articoolo.com/)

  • AX Semantics : Generates texts from data such as fact sheets, spreadsheet data and weather data.

(website: https://www.ax-semantics.com/en/home.html)

  • Phrasetech :  Generates content for eCommerce products.

(website: https://www.phrasetech.com/content/)

  • Article Forge : Generates article contents given a topic. It can also add title, links, videos and images to the article.

(website : http://www.articleforge.com/)

  • WordAi : Automatically rewrites sentences and paragraphs.

(website : https://wordai.com/)

Objectives

The automation system seeks to help ADB improve and hasten the development of concept papers. We achieve this objective by automating the creation of problem tree, monitoring framework, and writing the concept paper itself. In the proof-of-concept that we previously did, we focused on automating the generation of problem trees.

Methodology

Dataset Description

Six (6) types of documents were used as datasets: five (5) ADB document types and Xinhua news articles. All datasets involved China-based projects. ADB documents were all in .pdf formats; thus, we only worked with a few of each document type.  The project documents were obtained from the public website of ADB.

The breakdown of data used is as follows:

  • 6 Concept Papers with problem trees (focused on road construction)
  • 6 Completed Projects
  • 16 Recommendation Reports of the President
  • 3 Xinhua new articles
  • 1 Country Operations Business Plan (for 2018-2020)
  • 1 Country Partnership Strategy (for 2016-2020)

Information Extraction

The Digital Age has transformed the acquisition of information and interests in today’s society. Online documents and other web-based news outlets have been the target sources of computer language research. Information Extraction refers to the automatic extraction of structured information such as entities, relationships between entities, and attributes describing entities from unstructured sources [1]. Techniques from the Natural Language Processing (NLP) have been used to extract information from unstructured sources such as paragraphs from a document or news article. These sources are converted and stored in databases in a manner that entities and relationships between them are easily identified.

Open Information Extraction (OpenIE) by Stanford CoreNLP is the tool used in this project. OpenIE extracts open-domain relation triples, representing a subject, a relation, and the object of the relation [2]. Consider an excerpt from the document “Jiangxi Ji’an Sustainable Urban Transport Project”:

Jiangxi Ji’an is a prefectural level city in Jiangxi Province about 250 kilometers (km) south of the provincial capital of Nanchang. The city center is about 50 square km (km2) and has a population of about 520,000. The master plan for the center city envisions an expanded area of 150 to 200 km2 and a population of 1.5 to 2.0 million by 2030. 

OpenIE yields triplets, which depict the relationships between each element of the paragraph (see Figure X). Each triplet follows the format: source → connector → target.

Doc2Vec

As said, the goal of doc2vec is to create a numeric representation of a document, with regardless of its length. But unlike words, documents do not come in logical structures such as words, so the another method has to be found.

The concept that Mikilov and Le have used was simple, yet clever: they have used the word2vec model, and added another vector (Paragraph ID below)

It acts as a memory that remembers what is missing from the current context — or as the topic of the paragraph. While the word vectors represent the concept of a word, the document vector intends to represent the concept of a document. #intro description #how it was used

Validation Method

In the problem tree generator task, a model is considered good if it satisfies these conditions:

  • the generated problems are related to its given input
  • the generated problems are indeed problems
  • the generated causes and effects are indeed causes and effects of the problems 

All of these conditions can be evaluated once we have datasets containing word phrases combinations and tags of how related these word phrases are — cause, not cause, effect, not effect, similar, dissimilar.

Solution Architecture

Results

Figure X. Filtered problem output with “Beijing air pollution” as input. Green colored phrases are human-identified problems

Figure X. Filtered causes output with “Beijing air pollution” as input. Green colored phrases are human-identified problems

Next steps

The development of the solution was done during a 1 month period since it was only for POC. Thus, there were issues that was not addressed. Below is the list of tasks that are to be done once this project is resumed with a sufficient amount of development time:

  • Improve the algorithm of creating a problem tree.
    • Must find a way to tie up the concept of knowledge graph and doc2vec.
    • Find a way to refine the generated problem tree texts.
    • Find a way to identify the connections within problem trees.
    • Add learning user input functionality.
  • Build a solution of generating concept papers
  • Build a solution of generating monitoring framework

Potential new method to try: 

  1. Retrieve causal relation from texts
  2. Identify a problem phrase using sentiment analysis approach

Preprocessing

  1. Copy-paste texts from pdf to csv. (sources: cobp, completion reports, cps, rrp, new articles)