services-derivedata

THE CHALLENGE

  • Creating separate queues for every section in rabbitmq and checking the queues based on the time interval for extracting.
  • Initial spider collecting with sockets connection.
  • Working with Ajax/Onclick/Authentication sites.
  • Checking index pages for every 10 mins and discovering new spiders.
  • Extracting PDF/DOC/XML contents from normal URL link / downloading link.
  • Ensuring timely delivery of content without losing data, despite website configuration changes.
  • Identifying the website threshold limit to alert system administrators.
Development-services: