I'm trying to traverse a web page and find and download all pdfs. I have a solution ripped from another question for finding the links ending in .pdf using lmxl (I found it much faster than my own code, using mechanize) but I don't know how to use this to save those files to a folder. Can urlretrieve be used with lmxl and if so, how?
My code:
import lxml.html
import urllib2
import urlparse
from urllib import urlretrieve
base_url = 'http://www.example.html'
folder = "C:\Users\Meelah\Desktop\test_pdfs"
response = urllib2.urlopen(base_url)
tree = lxml.html.fromstring(response.read())
ns = {'re': 'http://exslt.org/regular-expressions'}
for node in tree.xpath('//a[re:test(@href, "\.pdf$", "i")]', namespaces=ns):
print urlparse.urljoin(base_url, node.attrib['href']) #
#code here to save it`