Case Study
 

THE OVERVIEW

Derivedata is the hub empowered robust platform which does continuous monitoring of different kinds of websites (HTML, RSS, AJAX, Angular, React etc.) and delivers structured information in the form of API.

For more information visit www.thehub.ai
 
 

Users are provided with value added information in the form of content metadata within the API.

The platform has a simple dashboard which enables system administrators to receive email alerts and statistics for rendering reports and charts.

 

The Process

 

THE REQUIREMENT

To design a page monitoring and extraction system which can monitor and scrape data from websites, providing useful information.

The platform should be able to deliver information instantly as and when its published in the websites.

The system should be able to notify administrators with email alerts in case the configured threshold is exceeded.

THE CHALLENGE

  • Creating separate queues for every section in rabbitmq and checking the queues based on the time interval for extracting.
  • Initial spider collecting with sockets connection.
  • Working with Ajax/Onclick/Authentication sites.
  • Checking index pages for every 10 mins and discovering new spiders.
  • Extracting PDF/DOC/XML contents from normal URL link / downloading link.
  • Ensuring timely delivery of content without losing data, despite website configuration changes.
  • Identifying the website threshold limit to alert system administrators.
 
 

THE SUCCESS

Our solution on page monitoring and extraction rely on the powerful technique of rabbitmq and reds queue system to identify newly added contents / updated contents from the websites. It then extracts this into structured API with metadata like initial revision, current revision and previous revision.

As soon as a website is added into our Node Js platform with the required configuration, the extractor system crawls carefully to identify all the potential links of the website called spiders and stores them on to elastic search.

Based on the configured intervals, the extractor system looks for new / updated contents on the website, scraps it and updates the API which can be embedded into any platform like mobile or web.