python - Howto use scrapy to crawl a website which hides the url as href="javascript:;" in the next button -

- August 15, 2013

i learning python , scrapy lately. googled , searched around few days, don't seem find instruction on how crawl multiple pages on website hidden urls - <a href="javascript:;". each page contains 20 listings, each time click on ">>" button, load next 20 items. can't figure out how find actual urls, below source code reference. pointers , appreciated.

visiting site web-browser , activated web-developer-tools (the following screenshots made firefox , add-on firebug) should able analyze network requests , responses. show sites pagination buttons send requests following:

enter image description here

so url seems be:

http://rent.591.com.hk/?m=home&c=search&a=rslist&type=1&shtype=list&p=2&searchtype=1

but it's not normal request. it's xmlhttprequest. indicated unter header tab. , response in json:

enter image description here

so don't need grab data complicated nested html-structures, can directly json dict.

i ended scrapy code (with room improvement):

import scrapy import json  class rentobject(scrapy.item):     address = scrapy.field()     purpose = scrapy.field()     # add more fields needed  class scrapespider(scrapy.spider):      name = "rent_hk"     allowed_domains = ['591.com.hk']     start_urls = ['http://rent.591.com.hk/?hl=en-us#list' ]      page_number = 0     page_num_max = 5 # test purposes grab 5 pages      def parse(self, response):          if 'page_number' in response.meta:             result_dict = json.loads(response.body)  # data dict             object in result_dict['items']:                 ro = rentobject()                 ro['address'] = object['address']                 ro['purpose'] = object['purpose']                 yield ro          # make request (next page) json data         self.page_number += 1          payload = {             'm': 'home',             'c': 'search',             'a': 'rslist',             'type': '1',             'p': str(self.page_number),             'searchtype': '1'         }          if self.page_number < self.page_num_max:             request = scrapy.formrequest(url='http://rent.591.com.hk/',                                          method='get',                                          formdata=payload,                                          headers={'referer': 'http://rent.591.com.hk/?hl=en-us',                                                   'x-requested-with': 'xmlhttprequest'},                                          callback=self.parse)             request.meta['page_number'] = self.page_number             yield request

the site not easy scrapy beginner - compiled detailed answer.

Search This Blog

harsh

python - Howto use scrapy to crawl a website which hides the url as href="javascript:;" in the next button -

Comments

Post a Comment

Popular posts from this blog

Java 3D LWJGL collision -

spring - SubProtocolWebSocketHandler - No handlers -

methods - python can't use function in submodule -