Class | SpiderInstance |
In: |
lib/spider/spider_instance.rb
|
Parent: | Object |
Add a predicate that determines whether to continue down this URL‘s path. All predicates must be true in order for a URL to proceed.
Takes a block that takes a string and produces a boolean. For example, this will ensure that the URL starts with ‘mike-burns.com’:
add_url_check { |a_url| a_url =~ %r{^http://mike-burns.com.*}
The Web is a graph; to avoid cycles we store the nodes (URLs) already visited. The Web is a really, really, really big graph; as such, this list of visited nodes grows really, really, really big.
Change the object used to store these seen nodes with this. The default object is an instance of Array. Available with Spider is a wrapper of memcached.
You can implement a custom class for this; any object passed to check_already_seen_with must understand just << and included? .
# default check_already_seen_with Array.new # memcached require 'spider/included_in_memcached' check_already_seen_with IncludedInMemcached.new('localhost:11211')
Add a response handler. A response handler‘s trigger can be :every, :success, :failure, or any HTTP status code. The handler itself can be either a Proc or a block.
The arguments to the block are: the URL as a string, an instance of Net::HTTPResponse, and the prior URL as a string.
For example:
on 404 do |a_url, resp, prior_url| puts "URL not found: #{a_url}" end on :success do |a_url, resp, prior_url| puts a_url puts resp.body end on :every do |a_url, resp, prior_url| puts "Given this code: #{resp.code}" end
Run before the HTTP request. Given the URL as a string.
setup do |a_url| headers['Cookies'] = 'user_id=1;admin=true' end
The Web is a really, really, really big graph; as such, this list of nodes to visit grows really, really, really big.
Change the object used to store nodes we have yet to walk. The default object is an instance of Array. Available with Spider is a wrapper of AmazonSQS.
You can implement a custom class for this; any object passed to check_already_seen_with must understand just push and pop .
# default store_next_urls_with Array.new # AmazonSQS require 'spider/next_urls_in_sqs' store_next_urls_with NextUrlsInSQS.new(AWS_ACCESS_KEY, AWS_SECRET_ACCESS_KEY, queue_name)