Web Crawling at Scale with Nokogiri and IronWorker


Web crawling is at the core of many web businesses. Whether you call it web crawling, web or screen scraping, or mining the web, it involves going to sites on the web, grabbing pages, and parsing them to pull out links, images, text, prices, titles, email addresses, and numerous other page attributes.

But extracting items from pages is just the start, though. After getting this data, it often needs to get processed, transformed, aggregated, posted, or manipulated. It could be to pull out the right image to display alongside an article blurb or to create some derivative values for use with content or opinion mining.

Crawling and processing can happen over a handful of sites that are regularly monitored or it could be done across thousands of sites. The combination of IronWorker, IronMQ, and IronCache is well suited for scheduling, orchestrating, and scaling out this type of work.

Get Started Today
Iron.io offers a 30 day free trial to get started. Try all features right away. No credit card required.

Nokogiri at a Glance

Nokogiri is the principal gem that Ruby developers use for parsing HTML pages. The reasons are pretty simple – it provides a comprehensive offering with many features, supports a wide variety of HTML and XML parsing strategies (DOM, SAX, Reader and Pull), has an enthusiastic community around it (which means easier support and greater enhancements), and offers good performance.

Using Nokogiri within an application or worker is pretty simple. Just download the gem, require it a worker file, and then create a Nokogiri instance. Included below is a code sample. This is part of a page processing worker within the larger web crawling example that you can find on Github at the link below.

The Page Processor worker below gets passed a set of page URLs and then goes through them in sequence to pull out a number the page attributes. Nokogiri takes an IO object or a String object as part of the instantiation. In this example, we use 'open-uri' and pass it a URL.


	page_processor.rb


	require 'open-uri'
	require 'nokogiri'

	def process_page(url)
	  doc = Nokogiri(open(url))
	  images, largest_image, list_of_images = process_images(doc)
	  links = process_links(doc)
	  css = process_css(doc)
	  words_stat = process_words(doc)
	end

Above is the main procedure for extracting items in a page. Below are the procedures to pull out the various page attributes using Nokogiri calls:


	def process_images(doc)
	  #get all images
	  images = doc.css("img")
	  #get image with highest height on page
	  largest_image = doc.search("img").sort_by { |img| img["height"].to_i }[-1]
	  largest_image = largest_image ? largest_image['src'] : 'none'
	  list_of_images = doc.search("img").map { |img| img["src"] }
	  return images, largest_image, list_of_images
	end

	def process_links(doc)
	  #get all links
	  links = doc.css("a")
	end

	def process_css(doc)
	  #find all css includes
	  css = doc.search("[@type='text/css']")
	end

	def process_words(doc)
	  #converting to plain text and removing tags
	  text = doc.text
	  #splitting by words
	  words = text.split(/[^a-zA-Z]/)
	  #removing empty string
	  words.delete_if{|e| e.empty?}
	  #creating hash
	  freqs = Hash.new(0)
	  #calculating stats
	  words.each { |word| freqs[word] += 1 }
	  freqs.sort_by {|x,y| y }
	end

Nokogiri - Deeper Dive

The full example on Github has a bit more sophistication to it (using IronCache to store the items collected for use by other workers). For a deeper dive on Nokogiri check out these tutorials here, here, and here and a great RailsCast here.

Crawling at Scale With IronWorker

Having a code library for crawling and parsing pages, however, is only one part of the equation for doing any amount of significant work. To crawl web pages at scale, there needs to be some effort behind the scenes to both gain enough compute power to handle the load as well as smoothly orchestrate all the parts within the process.

This is where Iron.io and IronWorker come in. By adding an elastic task queue to the mix, developers get the ability to scale page processing up and down without having to deal with servers, job queues, or other infrastructure concerns. A scalable task queue also lets developers distribute the work across thousands of cores. This not only provides compute power that developers can't easily achieve on their own, but it also will significantly reduce the duration of crawls (in some cases by 10x or even 100x).

Two Stage Process: Web Crawling ➞ Page Processing

Web crawling at scale means structuring your code to handle specific tasks in the operation. This makes it easier to distribute the work across multiple servers as well as create an agile framework for adding and expanding functionality. The base structure is to create two types of workers – one to crawl sites to extract the page links and another to process the pages to pull out the page attributes. By separating these tasks, you can focus attention on the resources and logic around site crawling workers separate from the resources and logic around page processing.

Webcrawling diagram1
Separate Workers – Separate Functions

The goal of the design is so each worker can operate as independently and as stateless as possible. Each worker should be specific to a task as well as designed to process a finite amount of data. In this way, you’ll be able to distribute the workload by activity as well as by portion. For site crawling, workers can start with the head of a site and then spawn off workers to handle sections or page-level depths as needed to distribute the load.

For page processing, workers can be designed to handle a certain number of pages. The reason is so that you can scale the work horizontally without any additional effort just by queuing up new tasks and passing in a list of pages. (In the full example and in part 2 of the blog post, we show how a message queue can help orchestrate the effort).

Queuing Up Workers

Once the general structure of the workers has been created, it’s a relatively simple matter to put web crawling at scale into motion. With an elastic task queue, it’s primarily the case of queuing up workers to run. The processing and distributing the workload across servers is handled automatically by IronWorker.


	@iron_worker_client.tasks.create("WebCrawler", p)

The components that you include when queuing a worker includes the name of the worker and a payload. Optional items such as priority, timeout, and delay can also be included. An example might be to run a WebCrawler task at a higher priority (p1) and with a 30 minute timeout.


	@iron_worker_client.tasks.create("WebCrawler", p, {:priority => 1}, {:timeout => 1800}

One thing to note is you need to upload your workers to IronWorker. Once it’s in the cloud, apps can spin up as many tasks as it needs. (Upload once, run many.)

Web crawling api queue2
API Spec for Queuing a Task

Managing the Workload: Task Duration and Visibility

The amount of processing each worker should be long enough in length to amortize the setup cost of a worker but not so long as to limit the ability to distribute the work. We generally encourage web crawling and page processing tasks that run in minutes as opposed to seconds. Somewhere on the order of 5-20 minutes appears to be the sweet spot based on the data we’ve been able to collect. (IronWorker has a 60-minute task limit.)

Web crawling task duration

One factor that plays into optimal task duration is the human element of monitoring tasks. Having to scroll through tons of 4-second tasks can be difficult. Same with looking at the log files of long running tasks. There’s no hard and fast rule but if you construct workers to handle varying amounts of work, then you can then find the right fit that lets you distribute the workload efficiently as well easily monitor the set of tasks. Below is a view of the IronWorker dashboard and how it can help you keep track of everything that's happening with a particular project.

Webcrawling status board
IronWorker Dashboard

More Extensive Web Crawling

To extend the approach, you can build in master or control workers as well as workers to do even more detailed processing on the extracted data. You can also add in workers to aggregate some of the details after a crawling cycle has completed.

The advantages of task-specific workers is that they don't necessarily have to impact the operation of any of the other workers. This allows a system to be extended much quickly and easily than something that's more tightly bound up within an application framework.

Webcrawling diagram2
Add Additional Workers to Extend Capabilities

The process of orchestrating these tasks is made easier by other IronWorker features such as scheduling and max_concurrency as well as by using IronMQ to create the right flow between processes and IronCache to provide easy ways to share state and store global data.

The combination of Nokogiri, one of the top HTML parsers, and IronWorker, an elastic task queue, provide a great combination for getting a lot of crawling done. Add in the other capabilities above and you have a platform that would have been out of reach for most teams just a few years ago.

Breaking Work Into Discrete Tasks and Workload Units

The core part of performing any type of work at scale is to break the work into discrete tasks and workload units. This allows the workload to be distributed easily across a set of servers. Monolithic architectures and tight bindings between processes make it difficult to move the work to elastic distributed environments.

In the case of web crawling, breaking the work up involves creating workers for specific tasks – in the base case, one worker to crawl and collect pages from sites and another worker to pull out specific page attributes from the pages.

Webcrawling diagram3
Separate Workers – Separate Functions

The other part of the process – the distribution of work across a cloud infrastructure – is where a task queue or worker system comes in. A task queue can either slot into an application as a scale component or it can be a stand-alone middleware layer within a distributed cloud-based system. In the case of the latter, different applications or triggers are queuing up tasks instead of a single application. A task queue – like Iron.io’s IronWorker service – manages the provisioning of resources and monitoring and distribution of jobs across a large set of servers. Developers just need to worry about creating and initiating tasks and the task queue performs the overhead of distributing and executing the tasks.

Tightly Coupled Workers -> Reduced Agility

In a model where one worker performs one task and then passes work off to another, it’s certainly possible to have the first worker queue one or more instances of the second worker. This is a simple way to handle things and is generally fine for limited flows and modest workloads.

The problem with directly chaining workers together, however, is that it doesn’t expand all that well. For example, some resources might be rate limited or have certain thresholds, so a large number of concurrent workers will run up against these limits. In these situations, the workers might have to have the intelligence to manage the state of the operation. This means additional overhead and a more brittle architecture. Also, given that worker tasks should ideally be finite in nature (there’s a 60-minute time limit for tasks in IronWorker), this means added overhead to maintain state.

Max Concurrency: Setting Limits on Number of Parallel Workers

Within IronWorker, there is a way to limit the number of concurrent workers. The feature is called Max Concurrency and it lets you set the maximum number of workers that can run at any one time, per worker type. The purpose is to let developers effectively manage rate limits and external resource issues. You can still queue up as many jobs as you want, but they’ll remain in the queue until the number of currently running worker tasks no longer exceeds the concurrency limit.

When queuing from the CLI, you would include the --max-concurrency argument with a value as part of the upload. And so if your worker is called WebCrawlerWorker and you had a worker_name.worker file called web_crawler_worker.worker, you would upload like this:


	iron_worker upload web_crawler_worker --max-concurrency 50

But even with things like max concurrency in place, a direct connection between tasks still ends up being somewhat of an anti-pattern. The first worker has to know a great deal about the second worker, including the name of the worker and the parameters it takes. Adding a second or even a third worker to do additional processing of page attributes means then having to bring in these dependencies as well.

Using Message Queues and Data Caches to Orchestrate Many Tasks

A better way is to orchestrate and coordinate large sets of connected tasks is to use message queues and key/value data caches. A message queue is used as a layer or broker between tasks. One process puts a message on a queue and another takes it off, making it so that each worker is independent of any other worker. This structure also lets each part of the processing scale independently. The tasks perform their work, put messages on a queue for other workers, and then can expire, all without causing any conflict with other parts of the process. (See the blog post on Spikability for more information on how applications can better handle unknown and/or inconsistent load.)

Adding a new task or step within the process is as simple as adding a new queue and then putting messages on the queue. A new worker can then take the messages off the queue and do the processing that it needs to do. An analogy might be to Unix pipe, where the results of one command are piped to another command – each command with a defined interface independent of any other commands.

A key/value data cache is used to share data between tasks, store temporary processing results, and maintain global counters and variables. It simplifies the process by reducing the need to use a database for temporary or transient data. Instead of having to open and maintain database connections and wrestle with schemas, developers can just simple http requests to post key/value pairs for other workers to access. This is akin to shared memory space although instead of within a single machine, it’s at a cloud level, accessible via HTTP requests and supporting many concurrent processes across thousands of independent cores.

Putting Messages on a Queue

In the web crawling example we’ve created, the site crawling worker grabs each page link, puts it in a message, and then places it in a queue in IronMQ. Worker processes on the receiving end take the messages off the queue and process the page links. In this Nokogiri crawling example, the page processing worker will process a 100 pages at a time.

In the web crawling worker, when a page is encountered, the link gets put in a cache in IronCache as well as placed on the message queue. The process on the receiving end of the queue will perform the page processing. (Note that in this example, other workers are also created in order to parse other portions of the site.)


	web_crawler.rb


	def process_page(url)
	  puts "Processing page #{url}"
	  #adding url to cache
	  @iron_cache_client.items.put(CGI::escape(url), {:status => "found"}.to_json)
	  #pushing url to iron_mq to process page
	  result = @iron_mq_client.messages.post(CGI::escape(url))
	  puts "Message put in queue #{result}"
	end

	def crawl_domain(url, depth)
	  #…
	      #getting page from cache
	      page_from_cache = @iron_cache_client.items.get(CGI::escape(page_url))

	      if page_from_cache.nil?
	        #page not processed yet so lets process it and queue worker if possible
	        process_page(page_url) if open_url(page_url)
	        queue_worker(depth, page_url) if depth > 1
	      else
	        puts "Link #{page_url} already processed, bypassing"
	        #page_from_cache.delete
	      end
	  #…
	end

Getting and Deleting Messages

The page processor worker gets multiple messages from the queue (get operations in IronMQ can retrieve many messages at once). These are then processed in a loop.

The message delete occurs after the message is fully processed. This is done instead of deleting the messages right after the ‘get’ operation so that if the worker fails for some reason, any unprocessed messages will automatically get put back on the queue for processing by another task.

Note that we use a cache to check if the URL has been processed (more details below).


	page_processor.rb


	def get_list_of_messages
	  #100 pages per worker at max
	  max_number_of_urls = 100
	  puts "Getting messages from IronMQ"
	  messages = @iron_mq_client.messages.get(:n => max_number_of_urls, :timeout => 100)
	  puts "Got messages from queue - #{messages.count}"
	  messages
	end

	#getting list of urls
	messages = get_list_of_messages

	#processing each url
	messages.each do |message|
	  url = CGI::unescape(message.body)
	  #getting page details if page already processed
	  cache_item = @iron_cache_client.items.get(CGI::escape(url))
	  if cache_item
	    process_page(url)
	  else
	    increment_counter(url, cache_item)
	  end
	  message.delete
	end

Using a Key/Value Data Cache

A key/value data cache is equally as valuable as a message queue in terms separating work and orchestrating asynchronous flow. A flexible data cache lets asynchronous processes leave data for others to act on along with exchanging state between a set of stateless processes.

A key/value data cache is different than a database in that it is meant less as a permanent data store (although data can persist indefinitely) but more as a data store for transitional activity. A data cache is particularly well-suited to storing, retrieving, and deleting data, but it doesn’t have the searching and scanning functions of an SQL or even NoSQL database.

In the case of web crawling, a cache can be used:

In a more expanded version of the page parsing loop, the data that’s collected gets placed in the cache for further processing. The choice of key – the URL – is an arbitrary choice here. Other keys can be used, even multiple items can be stored – one for each value for example.


	page_processor.rb


	def process_page(url)
	  puts "Processing page #{url}"
	  doc = Nokogiri(open(url))
	  images, largest_image, list_of_images = process_images(doc)
	  #processing links an making them absolute
	  links = process_links(doc).map { |link| make_absolute(link['href'], url) }.compact
	  css = process_css(doc)
	  words_stat = process_words(doc)
	  puts "Number of images on page:#{images.count}"
	  puts "Number of css on page:#{css.count}"
	  puts "Number of links on page:#{links.count}"
	  puts "Largest image on page:#{largest_image}"
	  puts "Words frequency:#{words_stat.inspect}"
	  #putting all in cache
	  @iron_cache_client.items.put(CGI::escape(url), {:status => "processed",
	                                  :number_of_images => images.count,
	                                  :largest_image => CGI::escape(largest_image),
	                                  :number_of_css => css.count,
	                                  :number_of_links => links.count,
	                                  :list_of_images => list_of_images,
	                                  :words_stat => words_stat,
	                                  :timestamp => Time.now,
	                                  :processed_counter => 1}.to_json)
	end

Here we’re using IronCache to store a counter of the number of times a page might have been bypassed because it was already processed.


	def increment_counter(url, cache_item)
	  puts "Page already processed, so bypassing it and incrementing counter"
	  item = JSON.parse(cache_item)
	  item["processed_counter"]+=1 if item["processed_counter"]
	  @iron_cache_client.items.put(CGI::escape(url), item.to_json)
	end

Note: With IronCache, the expiration of a key/value pair can set on a per-item basis depending on the need. For paid accounts, items can be cached indefinitely. Items can also be manually removed from the cache. After a set of crawling operations, the set of links can be flushed for the next time around or a timestamp can be included as part of the values and the key can persist from crawl to crawl.

Extending the Example

In this Nokogiri example, there’s one queue and one type of page processing worker. Using multiple IronMQ queues and one or more IronCache caches, it’s easy to extend the work and do even more sophisticated processing flows. For example, if a page needed to be processed in two different ways, two queues could be used, each with different workers operating on messages in the queue. If more extensive things need to happen to the data that’s collected, another queue could be added in the chain. The chain could extend as long as necessary to perform the work that’s needed.

The beauty of this approach is that developers can focus on creating simple focused tasks that do a few things well and then pass the work off to other tasks. Log files are isolated and task specific. Any changes that are made in one task are less likely to spill over to other parts in the chain. Any errors that occur therefore take place within a small range of possibilities, making it easier to isolate.

All in all, the system is easier to build, easier to scale, easier to monitor, and easier to maintain. All by adding task queues, message queues, data caches, and job scheduling that work in a distributed cloud environment.

With these type of elastic cloud components, it’s easy to break the chains of monolithic apps that make use of a limited set of servers, and take advantage of reliable distributed processing at scale. What’s not to love?



Start Working With Iron.io Today

Try all features right away. No credit card required.
Create Your Account
Want Information on Professional and Enterprise Plans? Contact Us

Stay Up To Date With the Latest News, Tips & Tricks