How to Extract Online Data for Research

ABOUT THIS Perspective

While a fair bit of data is actually available online, it's not in a machine readable format. The data is either in a .pdf file or inside tables on multiple web pages. Data available in these formats is difficult to analyze as it usually cannot be read by any statistical software.

A large part of our work at the JustJobs Network is to address the challenges arising in research processes, one of which is the lack of relevant, usable data.

The lack of data has been a recurrent problem for me while researching skill development and vocational training in India. This is one reason why we engage in primary data collection on these issues.

While a fair bit of data is actually available online, it’s not in a machine readable format. The data is either in a .pdf file or inside tables on multiple web pages. Data available in these formats is difficult to analyze as it usually cannot be read by any statistical software.

Data presented in .pdf files is particularly challenging to work with. However it is possible to extract data inside tables in .html files, such as the data on Industrial Training Institutes (ITIs) in India.

A useful way of extracting such data is by using a a python based data scraping framework, Scrapy, which can be set up using pip or easy_install. A good introduction to Scrapy can be found here: http://doc.scrapy.org/en/latest/intro/tutorial.html.

In some cases, you will need to interact with the javascript of the page before extracting data. For these situations, Splash – a light-weight browser developed by Scraping Hub – can be used.

Once installed, using Scrapy to extract data is quite easy. First you will need to get the urls for each ITI. This can be done in a few different ways. The easiest way is to use the following url format: https://ncvtmis.gov.in/Pages/ITI/Detail.aspx?code=iti_code

The website also lists the urls in the following format: ncvtmis.gov.in/Pages/ITI/Detail.aspx?ITI=xxxxx , where the url is completed by what looks like a combination of state and district codes, but I found the ITI codes easier to work with.

Once you have the codes, you can use the standard scrapy classes and functions to gather data on each ITI. First make the urls from above, the start_urls, then insert the wait time in the Splash code as a precaution to make sure the .html loads before the scraping begins (this step is optional), and finally insert the xpaths for the data you want to extract in your parse function. You will also need to make the corresponding changes in the items.py file. Now run the spider you have just written and save the data to a JSON file.

Data from the JSON file is easy to extract – just convert it to a Pandas dataframe and then outsheet to a .csv file.

While the process of data extraction using Scrapy may seem a bit complicated, it is still very useful. As demonstrated by the example above, the process allows us to have a dataset of every ITI in India along with data on the courses offered and the enrollment at the course level.

Stay tuned for interesting findings from this data.

The full code can be found here: https://github.com/amanbirs/data-scraping-projects/tree/master/itis_list

How to Extract Online Data for Research

You may also be interested in

FutureWORKS Collective: Charting New Frontiers in the Global South

Bihar—India’s latest urban frontier

Towards a GREEN future: Conserving Meghalaya’s forests through PES

Embracing Transgender Inclusion in the Workplace: A Call for Action and Acceptance

Will Gender Quotas in India’s Legislatures Deliver Gender Equality?

Time for a National Urban Employment Guarantee Act? Lessons from Odisha’s Urban Wage Employment Initiative

Constructing within planetary boundaries: Warnings from Mumbai’s coastal road project

Big Problems, Small Wins: Social Security for Rajasthan’s Gig Workers

Technology and Work: Pursuing Equitable Distribution Amidst Digitalisation

Rajasthan Drafts Bill For Gig Workers: Will It Hold Aggregators Accountable?

How India’s population exploded to overtake China’s and what’s next

Gig work must do more than expand female labour force participation

We are a global research organization with a mission to build an inclusive world of work