Python urllib2 web scraping 401 error but reachable in browser -
i trying scrape , similar pages using python:
url = "http://www.nature.com/nature/journal/v521/n7553/full/nature14410.html"
while can navigate page browser, i'm getting 401 authentication error urllib2 , can't figure out why. to clear, understand article behind paywall i'm interested in things title, authors, volume, references, etc, freely available , don't have subscription.
from urllib2 import urlopen urlopen("http://www.nature.com/nature/journal/v521/n7553/full/nature14410.html")
i've tried changing user-agent thinking site somehow detecting i'm not using browser
request = urllib2.request(url) opener = urllib2.build_opener() opener.add_headers = [('user-agent', 'mozilla/5.0')]
as non-web developer, it's not clear how can solve or figure out obstacle.
if use developer tools in chrome shows in browser particular page giving 401 unauthorized
response. unfortunately, urllib2
raises exception on error response, , makes harder see contents.
complicating case further fact nature.com
doesn't seem setting content-encoding
header indicate has gzipped response, though has.
try this:
import urllib2 import cstringio stringio import gzip def getdatafromcompressederror(url): try: urllib2.urlopen(url) except urllib2.urlerror e: data = e.read() strfile = stringio.stringio(data) gz = gzip.gzipfile(fileobj=strfile) return gz.read() if __name__ == "__main__": print getdatafromcompressederror("http://www.nature.com/nature/journal/v521/n7553/full/nature14410.html")
Comments
Post a Comment