Search our Blogs
Showing results for 
Search instead for 
Do you mean 
 

Building a government data visualizer using Node.js

Please note that HP IDOL OnDemand is now HPE Haven OnDemand. The API endpoints have changed to Haven OnDemand. Please see the API documentation for more details.

 

---

 

In February 2015 I haphazardly assembled a team of college students at Tufts’s International Development Hackathon. The goal? In about 18 hours, to create an app that causes social impact in the developing world using cutting-edge technologies.

 

Watch this video to learn more about the background and motivations for this app...

 

 

We decided to work with the World Bank to create an app that would help Tanzanian citizen journalists expose corruption in their government. Their government publishes budget data in dirty scanned PDFs, which are impossible to parse and make sense of. We wanted to empower citizens to convert these into machine-readable CSV (comma-separated values) format, which makes data visualization and number crunching easy. We aimed to help non-technical users build graphs, charts, tables, and – ultimately – stories out of the data buried in the PDFs. Given what we were looking to do, we decided to use HP’s IDOL OnDemand API, a free API that provides a suite of powerful machine learning, image processing, and data processing tools, which lead us to create Transparent, earning us first place overall.

 

image1.png

Scanned PDF of Tanzanian budget

 

Building your own version of Transparent

 

In this blog post, I’m going to walk you through building your own government data visualizer web app that uses many of the same technologies and techniques that my team did in building Transparent. We’re going to work through:

 

  • Scaffolding the app
  • Setting up basic functionality
  • Integrating APIs like HP IDOL OnDemand
  • Storing and extracting text from PDF files using IDOL OnDemand Store Object and OCR APIs

 

We’re going to focus on the most interesting and complex parts of creating the app, so some parts will be left out. But, you can check out Transparent’s source code to follow along and fill in the gaps. We’ll do things a little differently in this tutorial than in Transparent, but the big ideas will be the same.

 

Feature brainstorm

 

The first thing we need to do is set out what we want the app to do. Here were our visions for Transparent:

 

  • Let users upload and store Tanzanian budget PDFs
  • Convert these PDFs into text using IDOL OnDemand’s OCR (Optical Character Recognition) API
  • Convert the raw text into CSVs
  • Build in some basic data visualization so users can see graphs of the data in the CSVs
  • Allow users to download the CSVs to do their own visualization with a spreadsheet app
  • Let users download the CSV version of a PDF just by clicking on it

 

Your app will have similar features, but you can very easily adapt it to work with other types of PDFs – perhaps information published by your country. The beauty of this type of design is that you just have to swap one small algorithm to work with an entirely different type of PDF.

 

What you’ll need

 

You’ll want to be familiar with the following:

 

  • Frontend web development (designing webpages) with HTML, CSS, and JavaScript
  • Backend web development (building server apps) with Node.js and Express
  • Using Git for version control
  • Working with the terminal

 

And you’ll want to have the following:

 

  • A laptop running OS X or Linux (if you have Windows, you can run Ubuntu in a virtual machine.
  • A free GitHub
  • A text editor. I recommend GitHub’s Atom editor.

 

Getting started

 

Setting up GitHub

 

GitHub is a very popular home for open-source projects. For this project, you’ll store the source code that makes up your app here so that your co-authors and the world can see it, adapt it, and improve it. More generally, it’s a great place to contribute to cool software projects or even lead some yourself.

Let’s make a repository, or home for your project’s code, on GitHub. Create a new public repository. Choose an open-source license (I recommend MIT), don’t specify a .gitignore, and boom – you have a repository. Your project will grow here!

 

Getting the code

 

Your repository only lives online now – let’s grab a copy of the code for your own laptop so that you can actually work on it. On your project’s GitHub page, you’ll see a text area that contains something called a ‘HTTPS Clone URL’. Copy that text. Now open up your terminal (also called your command line or console) and use cd to find a folder where you want your project to live.

 

We’re going to start entering commands on the terminal. For the purposes of this guide, all terminal commands will be in a styled box and have a $ in front. Don’t actually type the $; that’s just to indicate you’re at the terminal.

 

Anyway, run this command:

 

$ git clone *[HTTPS Clone URL]*

 

Replace the stuff in the brackets with what you just copied. For instance, to get a copy of the repository my team used, you would do

 

$ git clone https://github.com/Team-Transparent/IDHacks.git

 

Getting more tools

 

We’re going to need some free tools to build this web app. We’ll use the Node Package Manager (NPM) to install the first few. Run the following:

 

$ npm install -g yo bower grunt-cli express

 

Yeoman (nicknamed “yo”) is a powerful tool for scaffolding out web apps (setting up the files and folders you’ll need), Bower helps you manage CSS and JavaScript libraries, and Grunt is useful for running your code on your machine before you deploy it elsewhere. Express makes it easy to run a web server using Node.js.

 

We’re ready to go!

 

Scaffolding

 

Generating the structure with Yeoman

 

Yeoman makes it incredibly simple to generate all the files, folders, and configuration you’ll need for your web app, which saves you lots of time and energy. cd into your project folder and run:

 

$ npm install -g generator-express
$ yo express

 

This runs a Yeoman “generator”, which generates the files and folders you’ll need. You’ll be prompted to choose several settings.

 

Choose:

  • New directory: no
  • Version: Basic
  • View engine: Handlebars
  • Preprocessor: None
  • Build tool: Grunt

Once it’s done you’ll see a bunch of stuff in your folder, including:

  • /public/ – this is where your frontend CSS and JavaScript code will live.
  • /views/ – this is where your pages’ HTML will live. (Actually, it’s a templating language called Jade, which compiles into HTML.)
  • /routes/ – this is where your server-side JavaScript code will live.

Don’t worry too much about the rest.

 

Installing necessary packages

We’ll be using a bunch of tools called packages to help build our app. Node.js provides useful packages for our server. Get some with:

 

$ npm install --save underscore restler

 

Now we can get some frontend packages with Bower, which is itself a Node.js package. Run this:

 

$ bower install --save bootstrap fontawesome jquery

 

Try it out

 

Run

 

$ grunt

 

And open up localhost:3000 in your browser. You’ll see a basic webpage.

 

Congrats! It’s still very boring, but we’re going to change that.

 

Converting PDFs to text using IDOL OnDemand

 

Let’s tackle the first interesting challenge, which is using IDOL OnDemand’s OCR (Optical Character Recognition) to extract raw text out of PDFs.

 

Before we get to the code to actually extract the text, we’ll have to do some miscellaneous setup.

 

Uploading PDFs

 

We need some PDFs to get text out of, of course. Make a new folder public/pdf and put some interesting PDFs there. Give them numerical names like 1.pdf, 2.pdf, and so on.

 

Feel free to use team Transparent’s Tanzanian budget PDFs.

 

A new view

 

A view is basically a template that your app will throw data into and serve to the user. Let’s make a very simple view that, given some raw text, will just show it to the user. In the views folder, create a new file called dump.handlebars and put the following in it:

 

<h1>Text dump</h1>
<pre>{{text}}</pre>

See that “text” inside the double curly braces? That’s a variable. Your app will send a variable called text to this template, and this template will replace {{text}} with whatever the variable’s value was and serve up that HTML page to the user. It’s a very simple way of sending some data from the server to the user.

 

Routing

 

Let’s set up some routing – telling the app to do something when the user visits a certain URL. Usually this involves taking some action and rendering some page. Open up routes/index.js and you’ll see this:

 

var express = require('express');
var router = express.Router();

/* GET home page. */

router.get('/', function(req, res) {
  res.render('index', { title: 'Express' });
});

module.exports = router;

 

Let’s add a new test route that, given a certain URL, will spit out some text using the “dump” template we just created. Insert the following code right after the first router.get(...) block of code.

 

router.get('/test', function(req, res) {
    res.render('dump', { text: "Hi there!" });
});

 

This means that whenever the user visits localhost:3000/test, the dump.handlebars view will get rendered with the text “Hi there!”. That means the user will see this HTML:

 

<h1>Text dump</h1>
<pre>Hi there!</pre>

 

Open up localhost:3000/test to see this for yourself.

 

Introducing the HP IDOL OnDemand API

 

We’re ready to start using HP’s IDOL OnDemand API, which contains a bunch of useful big data tools like face/image recognition, sentiment analysis (determining the tone of a piece of text), machine learning, and OCR (optimal character recognition). We’ll be using the last API to extract text out of a PDF. Check out the OCR demo.

 

I’d used IDOL OnDemand at a past hackathon because it was a powerful tool for making sense of complicated data like PDFs and decided to use it again here for the same reason. It provides some incredibly powerful tools that can help you create really cool projects – plus, it’s all free.

To start using this API, sign up for a developer account. Once you’ve logged in, you can find your API Key here – you’ll need it to use the API.

 

OCR with IDOL OnDemand

 

Let’s actually write a function that, given the URL to a PDF we uploaded, extracts the text from it using the IDOL OnDemand. Put this toward the top of routes/index.js, right below the var router... line. It’s a lot of code, so read through it to make sure you understand! Basically, we’re just calling the APIs to store the PDF and extract the text from the stored file.

 

// package to read files
var fs = require('fs');
// package to make AJAX requests
var restler = require('restler');
// utility functions
var _ = require('underscore');

// your IDOL OnDemand api key that you just generated
var API_KEY = "[YOUR API KEY HERE]";

/**
    Given the URL to a PDF, runs it through IDOL OnDemand's OCR API to extract the text from it.
    If the text is successfully extracted, calls the success callback with the text.
    If anything fails, calls the failure callback.
*/
var convertPdf = function(filename, success, failure) {
    // store the PDF on IDOL OnDemand's servers for later processing
    // grab the file itself
    fs.stat(filename, function(err, stats) {
        // store the file
        restler.post("https://api.idolondemand.com/1/api/sync/storeobject/v1", {
            multipart: true,
            data&colon; {
                apikey: API_KEY,
                file: restler.file(filename, null, stats.size, null, "application/pdf")
            }
        }).on("complete", function(data) {
            // IDOL returns a reference (a unique identifier) to the uploaded file
            // data = { reference : string }
            if (data && data.reference) {
                // run the PDF through the OCR api
                restler.post("https://api.idolondemand.com/1/api/sync/ocrdocument/v1", {
                    data&colon; {
                        apikey: API_KEY,
                        reference: data.reference,
                        mode: "document_scan"
                    }
                }).on("complete", function(data) {
                    if (data.text_block) {
                        // success! the PDF's text is encoded in `data`;
                        // now just grab the text from the raw data
                        var text = _(data.text_block).pluck('text').join('\n');
                        if (success) {
                            success(text);
                        };
                    }
                    else {
                        // something went wrong with OCR!
                        if (failure) {
                            failure();
                        }
                    }
                });
            }
            else {
                // something went wrong storing the file!
                if (failure) {
                    failure(); 
                };
            }
        });
    });
}

 

Now we can use this function! Add this new route to routes/index.js below the first two:

 

// Extract text from the PDF with the given id and show it to the user.
router.get('/text/:id', function(req, res, next ) {
    // pdfId = the :id from the query string
    var pdfId = req.params.id;
    var filename = "public/pdf/" + pdfId + ".pdf";
    convertPdf(filename, function success(text) {
        // just dump the extracted text
        res.render('dump', { text: text });
    }, function failure(){
        // show that there was an error
        res.render('dump', { message: "Error parsing PDF!"});
    });
});

 

Now if you visit localhost:3000/text/12, you’ll see the text contained in public/pdf/12.pdf! (Naturally, what URLs you use depends on what numbers you gave your PDFs.)

 

Success!

 

Awesome – we’ve set up a web app that uses Node.js and IDOL OnDemand’s OCR API to extract text out of any PDF you’ve uploaded.

 

 

~ Neel Mehta, Harvard ’18 (hathix.com)

Social Media
About the Author
Topics
† The opinions expressed above are the personal opinions of the authors, not of HPE. By using this site, you accept the Terms of Use and Rules of Participation