-4

the task is easy: use Python to download all PDFs from:

https://www.electroimpact.com/Company/Patents.aspx

I am just a beginner of Python. I read python crawler but samples deal with html not aspx. And all I got is blank file downloaded.

Following is my code:

import urllib2
import re

def saveFile(url, fileName):
    request = urllib2.Request(url)
    response = urllib2.urlopen(request)
    with open(fileName,'wb') as handle:
        handle.write(response.read())

def main():
    base_url = 'https://www.electroimpact.com/Company/Patents/'
    page = 'https://www.electroimpact.com/Company/Patents.aspx'
    request = urllib2.Request(page)
    response = urllib2.urlopen(request)
    url_lst = re.findall('href.*(US.*\.pdf)', response.read())
    print url_lst

Result: 
    ['US5201205.pdf', 'US5279024.pdf', 'US5339598.pdf', 'US9021688B2.pdf']

Only 4 PDFs were found by my regular expression. Actually, there are much more PDFs to extract. Why?

  • ASPX is HTML, it's just a differing file extension like PHP. – Steve Jan 11 '17 at 16:04
  • thanks for the hint. I try to solve this task by using urllib2 + re. But there must be something wrong with my regular expression. Many items are missing. Can you help to find where the error is? – user7405020 Jan 12 '17 at 15:19
  • Unfortunately I'm not a Python programmer. – Steve Jan 12 '17 at 15:45
  • [You can't parse HTML with regex.](https://stackoverflow.com/questions/1732348/regex-match-open-tags-except-xhtml-self-contained-tags/1732454#1732454) Use some library that actually _can_ parse HTML like BeautifulSoup 4 or `lxml.html`. – BlackJack Jan 25 '17 at 13:21

1 Answers1

0

With lxml.html and cssselect instead of re you will get all linked patent document paths:

#!/usr/bin/env python
# coding: utf8
from __future__ import absolute_import, division, print_function
import urllib2
from lxml import html


def main():
    url = 'https://www.electroimpact.com/Company/Patents.aspx'
    source = urllib2.urlopen(url).read()
    document = html.fromstring(source)
    patent_paths = [
        a.attrib['href'] for a in document.cssselect('div.PatentNumber a')
    ]
    print(patent_paths)


if __name__ == '__main__':
    main()
BlackJack
  • 4,476
  • 1
  • 20
  • 25