Date

Problem

When you try to download resources from CloudFlare with Scrapy most of the time you'll get HTTP 403 error. This is CloudFlares' so-called "I'm under attack" mode, when they're restricting downloading resources for "plain" access (like curl http://resource_url).

Solution

Step 1: Cookie

First of all, you should get CloudFlares' cookie. They set it up when you visit website for first time without such cookie, and, by default, Scrapy use it for all Request objects it produce in the same scrapy crawl session.

But if you, like me myself, don't use Request object to download, for example, images - you must get this cookie by overriding parse_start_url() method of Spider class.

def parse_start_url(self, response):
    self.cookie = response.headers.get('Set-Cookie').split(';')[0]
    return super(Spider, self).parse_start_url(response)

In my case it was first, and the only one cookie that I'm getting, but you should not rely on this. Instead, search for cookie named __cfduid.

Step 2: Referer

Always set proper referer, response.url is totally OK for this.

Step 3: Use "real" User-Agent

CloudFlare don't like non-"browser-ish" user-agents, so use one from your browser. You may want to check What's My User Agent? website for this purpose.

Full example

In my case it'd be something like that:

import urllib2
from scrapy.contrib.linkextractors.sgml import SgmlLinkExtractor
from scrapy.contrib.spiders import CrawlSpider, Rule
from scrapy.selector import Selector


USER_AGENT = 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_9_5) AppleWebKit/537.36 ' 
             '(KHTML, like Gecko) Chrome/42.0.2311.135 Safari/537.36'


def download(url, path, referer=None, cookie=None)
    req = urllib2.Request(url)
    if referer:
        req.add_header("Referer", referer)
    if cookie:
        req.add_header("Cookie", cookie)
    req.add_header("User-Agent", USER_AGENT)
    u = urllib2.urlopen(req)
    f = open(file_name, 'wb')
    meta = u.info()
    file_size = int(meta.getheaders("Content-Length")[0])
    print u"Downloading: %s Bytes: %s" % (file_name, file_size)

    file_size_dl = 0
    block_sz = 8192
    while True:
        _buffer = u.read(block_sz)
        if not _buffer:
            break

        file_size_dl += len(_buffer)
        f.write(_buffer)
        status = r"%10d  [%3.2f%%]" % (file_size_dl, file_size_dl * 100. / file_size)
        status += chr(8) * (len(status) + 1)
        print status,

    f.close()


class Spider(CrawlSpider):
    name = 'spider_name'
    start_urls = [
        'http://example.com'
    ]
    allowed_domains = ['example.com', ]
    rules = (
        Rule(SgmlLinkExtractor(allow=(r'page\.php\?id=\d+$',)), callback='parse_item'),
    )

    def parse_start_url(self, response):
        self.cookie = response.headers.get('Set-Cookie').split(';', 1)[0]
        return super(Spider, self).parse_start_url(response)

    def parse_item(self, response):
        sel = Selector(response)
        image_url = sel.xpath('//img/@src').extract()[0]
        return download(image_url, path, response.url, self.cookie)

Basically, that's it!


Comments

comments powered by Disqus