It's not pretty code, but I have some code that grabs a series of strings out of an HTML file and gives me a series of strings: author
, title
, date
, length
, text
. I have 2000+ html files and I want go through all of them and write this data to a single csv file. I know all of this will have to be wrapped into a for
loop eventually, but before then I am having a hard time understanding how to go from getting these values to writing them to a csv file. My thinking was to create a list or a tuple first and then write that to a line in a csv file:
the_file = "/Users/john/Code/tedtalks/test/transcript?language=en.0"
holding = soup(open(the_file).read(), "lxml")
at = holding.find("title").text
author = at[0:at.find(':')]
title = at[at.find(":")+1 : at.find("|") ]
date = re.sub('[^a-zA-Z0-9]',' ', holding.select_one("span.meta__val").text)
length_data = holding.find_all('data', {'class' : 'talk-transcript__para__time'})
(m, s) = ([x.get_text().strip("\n\r")
for x in length_data if re.search(r"(?s)\d{2}:\d{2}",
x.get_text().strip("\n\r"))][-1]).split(':')
length = int(m) * 60 + int(s)
firstpass = re.sub(r'\([^)]*\)', '', holding.find('div', class_ = 'talk-transcript__body').text)
text = re.sub('[^a-zA-Z\.\']',' ', firstpass)
data = ([author].join() + [title] + [date] + [length] + [text])
with open("./output.csv", "w") as csv_file:
writer = csv.writer(csv_file, delimiter=',')
for line in data:
writer.writerow(line)
I can't for the life of me figure out how to get Python to respect the fact that these are strings and should be stored as strings and not as lists of letters. (The .join()
above is me trying to figure this out.)
Looking ahead: is it better/more efficient to handle 2000 files this way, stripping them down to what I want and writing one line of the CSV at a time or is it better to build a data frame in pandas
and then write that to CSV? (All 2000 files = 160MB, so stripped down, the eventual data can't be more than 100MB, so no great size here, but looking forward size may eventually become an issue.)