How to Extract Online Data for Research

18 November 2015
ABOUT THIS Perspective

While a fair bit of data is actually available online, it's not in a machine readable format. The data is either in a .pdf file or inside tables on multiple web pages. Data available in these formats is difficult to analyze as it usually cannot be read by any statistical software.

A large part of our work at the JustJobs Network is to address the challenges arising in research processes, one of which is the lack of relevant, usable data.

The lack of data has been a recurrent problem for me while researching skill development and vocational training in India. This is one reason why we engage in primary data collection on these issues.

While a fair bit of data is actually available online, it’s not in a machine readable format. The data is either in a .pdf file or inside tables on multiple web pages. Data available in these formats is difficult to analyze as it usually cannot be read by any statistical software.

Data presented in .pdf files is particularly challenging to work with. However it is possible to extract data inside tables in .html files, such as the data on Industrial Training Institutes (ITIs) in India.

A useful way of extracting such data is by using a a python based data scraping framework, Scrapy, which can be set up using pip or easy_install. A good introduction to Scrapy can be found here: http://doc.scrapy.org/en/latest/intro/tutorial.html.

In some cases, you will need to interact with the javascript of the page before extracting data. For these situations, Splash – a light-weight browser developed by Scraping Hub – can be used.

Once installed, using Scrapy to extract data is quite easy. First you will need to get the urls for each ITI. This can be done in a few different ways. The easiest way is to use the following url format: https://ncvtmis.gov.in/Pages/ITI/Detail.aspx?code=iti_code

The website also lists the urls in the following format: ncvtmis.gov.in/Pages/ITI/Detail.aspx?ITI=xxxxx , where the url is completed by what looks like a combination of state and district codes, but I found the ITI codes easier to work with.

Once you have the codes, you can use the standard scrapy classes and functions to gather data on each ITI. First make the urls from above, the start_urls, then insert the wait time in the Splash code as a precaution to make sure the .html loads before the scraping begins (this step is optional), and finally insert the xpaths for the data you want to extract in your parse function. You will also need to make the corresponding changes in the items.py file. Now run the spider you have just written and save the data to a JSON file.

Data from the JSON file is easy to extract – just convert it to a Pandas dataframe and then outsheet to a .csv file.

While the process of data extraction using Scrapy may seem a bit complicated, it is still very useful. As demonstrated by the example above, the process allows us to have a dataset of every ITI in India along with data on the courses offered and the enrollment at the course level.

Stay tuned for interesting findings from this data.

The full code can be found here: https://github.com/amanbirs/data-scraping-projects/tree/master/itis_list