Python urllib2 web scraping 401 error but reachable in browser -

- January 15, 2014

i trying scrape , similar pages using python:

url = "http://www.nature.com/nature/journal/v521/n7553/full/nature14410.html"

while can navigate page browser, i'm getting 401 authentication error urllib2 , can't figure out why. to clear, understand article behind paywall i'm interested in things title, authors, volume, references, etc, freely available , don't have subscription.

from urllib2 import urlopen urlopen("http://www.nature.com/nature/journal/v521/n7553/full/nature14410.html")

i've tried changing user-agent thinking site somehow detecting i'm not using browser

request = urllib2.request(url) opener = urllib2.build_opener() opener.add_headers = [('user-agent', 'mozilla/5.0')]

as non-web developer, it's not clear how can solve or figure out obstacle.

if use developer tools in chrome shows in browser particular page giving 401 unauthorized response. unfortunately, urllib2 raises exception on error response, , makes harder see contents.

complicating case further fact nature.com doesn't seem setting content-encoding header indicate has gzipped response, though has.

try this:

import urllib2 import cstringio stringio import gzip  def getdatafromcompressederror(url):     try:         urllib2.urlopen(url)     except urllib2.urlerror e:         data = e.read()         strfile = stringio.stringio(data)         gz = gzip.gzipfile(fileobj=strfile)         return gz.read()   if __name__ == "__main__":     print getdatafromcompressederror("http://www.nature.com/nature/journal/v521/n7553/full/nature14410.html")

Search This Blog

harsh

Python urllib2 web scraping 401 error but reachable in browser -

Comments

Post a Comment

Popular posts from this blog

Java 3D LWJGL collision -

spring - SubProtocolWebSocketHandler - No handlers -

methods - python can't use function in submodule -