python - Howto use scrapy to crawl a website which hides the url as href="javascript:;" in the next button -
i learning python , scrapy lately. googled , searched around few days, don't seem find instruction on how crawl multiple pages on website hidden urls - <a href="javascript:;"
. each page contains 20 listings, each time click on ">>" button, load next 20 items. can't figure out how find actual urls, below source code reference. pointers , appreciated.
visiting site web-browser , activated web-developer-tools (the following screenshots made firefox , add-on firebug) should able analyze network requests , responses. show sites pagination buttons send requests following:
so url seems be:
http://rent.591.com.hk/?m=home&c=search&a=rslist&type=1&shtype=list&p=2&searchtype=1
but it's not normal request. it's xmlhttprequest
. indicated unter header
tab. , response in json:
so don't need grab data complicated nested html-structures, can directly json dict.
i ended scrapy code (with room improvement):
import scrapy import json class rentobject(scrapy.item): address = scrapy.field() purpose = scrapy.field() # add more fields needed class scrapespider(scrapy.spider): name = "rent_hk" allowed_domains = ['591.com.hk'] start_urls = ['http://rent.591.com.hk/?hl=en-us#list' ] page_number = 0 page_num_max = 5 # test purposes grab 5 pages def parse(self, response): if 'page_number' in response.meta: result_dict = json.loads(response.body) # data dict object in result_dict['items']: ro = rentobject() ro['address'] = object['address'] ro['purpose'] = object['purpose'] yield ro # make request (next page) json data self.page_number += 1 payload = { 'm': 'home', 'c': 'search', 'a': 'rslist', 'type': '1', 'p': str(self.page_number), 'searchtype': '1' } if self.page_number < self.page_num_max: request = scrapy.formrequest(url='http://rent.591.com.hk/', method='get', formdata=payload, headers={'referer': 'http://rent.591.com.hk/?hl=en-us', 'x-requested-with': 'xmlhttprequest'}, callback=self.parse) request.meta['page_number'] = self.page_number yield request
the site not easy scrapy beginner - compiled detailed answer.
Comments
Post a Comment