I'm writing a scraper to open a CSV, get a list of links, extract a specific HTML tag in the site (speechs) and save the content in a TXT file, named after the day the speech was given.
Here is was I accomplished:
#encoding:utf-8
import csv
import urllib
import lxml.html
import unicodedata
objeto = csv.reader(open('links.csv', 'rU'), dialect=csv.excel_tab)
for link in objeto:
connection = urllib.urlopen(link[0])
dom = lxml.html.fromstring(connection.read())
discurso = []
for d in dom.xpath('//div[@id="content-core"]/div/p/text()'):
discurso.append(d)
d1 = " ".join(discurso)
data = dom.xpath('//span[@class="documentPublished"]/text()[normalize-space()]')
data1 = [date.strip() for date in data]
make_string = "-".join(data1)
file = open(make_string+'.txt', 'w')
file= arquivo.write(d1)
file.close()
I was able to extract the date and the speech, but the final step is not working. When trying to save the speech a in TXT file, the IDLE shows me the message
IOError: [Errno 2] No such file or directory: '17/12/2010 23h39,.txt'
I've tried using 'w' and 'a' when creating the file, but it failed. What am I doing wrong?