0

I'm trying to traverse a web page and find and download all pdfs. I have a solution ripped from another question for finding the links ending in .pdf using lmxl (I found it much faster than my own code, using mechanize) but I don't know how to use this to save those files to a folder. Can urlretrieve be used with lmxl and if so, how?

My code:

import lxml.html 
import urllib2 
import urlparse
from urllib import urlretrieve

base_url = 'http://www.example.html'
folder = "C:\Users\Meelah\Desktop\test_pdfs"

response = urllib2.urlopen(base_url)

tree = lxml.html.fromstring(response.read())

ns = {'re': 'http://exslt.org/regular-expressions'}

for node in tree.xpath('//a[re:test(@href, "\.pdf$", "i")]', namespaces=ns):
    print urlparse.urljoin(base_url, node.attrib['href']) #
    #code here to save it`
Meelah
  • 243
  • 3
  • 11
  • Please state exactly what it not working and what the error messages are... – Flavian Hautbois Oct 13 '14 at 15:40
  • I don't think your question has anything to do with xpath or lxml assuming that your code already successfully returns the URLs pointing to the PDF files. What you need is a Python library to download a file specified by an URL. – Marcus Rickert Oct 13 '14 at 15:45
  • Apologies if I wasn't clear @Flav, the code as it stands is working fine and printing all the links ending in .pdf to the screen. The problem is just that I don't know where to go from here. I don't know how to call urlretrieve based on this code or which other library I should use. – Meelah Oct 13 '14 at 15:56
  • Dupplicate of http://stackoverflow.com/questions/9751197/opening-pdf-urls-with-pypdf in my opinion, then. – Flavian Hautbois Oct 13 '14 at 15:59

0 Answers0