Class SpiderInstance
In: lib/spider/spider_instance.rb
Parent: Object

Methods

Public Instance methods

Add a predicate that determines whether to continue down this URL‘s path. All predicates must be true in order for a URL to proceed.

Takes a block that takes a string and produces a boolean. For example, this will ensure that the URL starts with ‘mike-burns.com’:

 add_url_check { |a_url| a_url =~ %r{^http://mike-burns.com.*}

The Web is a graph; to avoid cycles we store the nodes (URLs) already visited. The Web is a really, really, really big graph; as such, this list of visited nodes grows really, really, really big.

Change the object used to store these seen nodes with this. The default object is an instance of Array. Available with Spider is a wrapper of memcached.

You can implement a custom class for this; any object passed to check_already_seen_with must understand just << and included? .

 # default
 check_already_seen_with Array.new

 # memcached
 require 'spider/included_in_memcached'
 check_already_seen_with IncludedInMemcached.new('localhost:11211')

Reset the headers hash.

Use like a hash:

 headers['Cookies'] = 'user_id=1;password=btrross3'

Add a response handler. A response handler‘s trigger can be :every, :success, :failure, or any HTTP status code. The handler itself can be either a Proc or a block.

The arguments to the block are: the URL as a string, an instance of Net::HTTPResponse, and the prior URL as a string.

For example:

 on 404 do |a_url, resp, prior_url|
   puts "URL not found: #{a_url}"
 end

 on :success do |a_url, resp, prior_url|
   puts a_url
   puts resp.body
 end

 on :every do |a_url, resp, prior_url|
   puts "Given this code: #{resp.code}"
 end

Run before the HTTP request. Given the URL as a string.

 setup do |a_url|
   headers['Cookies'] = 'user_id=1;admin=true'
 end

The Web is a really, really, really big graph; as such, this list of nodes to visit grows really, really, really big.

Change the object used to store nodes we have yet to walk. The default object is an instance of Array. Available with Spider is a wrapper of AmazonSQS.

You can implement a custom class for this; any object passed to check_already_seen_with must understand just push and pop .

 # default
 store_next_urls_with Array.new

 # AmazonSQS
 require 'spider/next_urls_in_sqs'
 store_next_urls_with NextUrlsInSQS.new(AWS_ACCESS_KEY, AWS_SECRET_ACCESS_KEY, queue_name)

Run last, once for each page. Given the URL as a string.

[Validate]