the task is easy: use Python to download all PDFs from:
https://www.electroimpact.com/Company/Patents.aspx
I am just a beginner of Python. I read python crawler but samples deal with html not aspx. And all I got is blank file downloaded.
Following is my code:
import urllib2
import re
def saveFile(url, fileName):
request = urllib2.Request(url)
response = urllib2.urlopen(request)
with open(fileName,'wb') as handle:
handle.write(response.read())
def main():
base_url = 'https://www.electroimpact.com/Company/Patents/'
page = 'https://www.electroimpact.com/Company/Patents.aspx'
request = urllib2.Request(page)
response = urllib2.urlopen(request)
url_lst = re.findall('href.*(US.*\.pdf)', response.read())
print url_lst
Result:
['US5201205.pdf', 'US5279024.pdf', 'US5339598.pdf', 'US9021688B2.pdf']
Only 4 PDFs were found by my regular expression. Actually, there are much more PDFs to extract. Why?