2

It's not pretty code, but I have some code that grabs a series of strings out of an HTML file and gives me a series of strings: author, title, date, length, text. I have 2000+ html files and I want go through all of them and write this data to a single csv file. I know all of this will have to be wrapped into a for loop eventually, but before then I am having a hard time understanding how to go from getting these values to writing them to a csv file. My thinking was to create a list or a tuple first and then write that to a line in a csv file:

the_file = "/Users/john/Code/tedtalks/test/transcript?language=en.0"
holding = soup(open(the_file).read(), "lxml")
at = holding.find("title").text
author = at[0:at.find(':')]
title  = at[at.find(":")+1 : at.find("|") ]
date = re.sub('[^a-zA-Z0-9]',' ', holding.select_one("span.meta__val").text)
length_data = holding.find_all('data', {'class' : 'talk-transcript__para__time'})
(m, s) = ([x.get_text().strip("\n\r") 
      for x in length_data if re.search(r"(?s)\d{2}:\d{2}", 
                                        x.get_text().strip("\n\r"))][-1]).split(':')
length = int(m) * 60 + int(s)
firstpass = re.sub(r'\([^)]*\)', '', holding.find('div', class_ = 'talk-transcript__body').text)
text = re.sub('[^a-zA-Z\.\']',' ', firstpass)
data = ([author].join() + [title] + [date] + [length] + [text])
with open("./output.csv", "w") as csv_file:
        writer = csv.writer(csv_file, delimiter=',')
        for line in data:
            writer.writerow(line)

I can't for the life of me figure out how to get Python to respect the fact that these are strings and should be stored as strings and not as lists of letters. (The .join() above is me trying to figure this out.)

Looking ahead: is it better/more efficient to handle 2000 files this way, stripping them down to what I want and writing one line of the CSV at a time or is it better to build a data frame in pandas and then write that to CSV? (All 2000 files = 160MB, so stripped down, the eventual data can't be more than 100MB, so no great size here, but looking forward size may eventually become an issue.)

John Laudun
  • 407
  • 1
  • 9
  • 19
  • If you just want a csv file then creating a dataframe first and then saving to a csv file is not going to be faster than just creating the csv. There seems to be an awful lot of regex in your code, if you can upload a couple of your files I can have a look at maybe making the code a little easier to follow which may make writing a lot easier. Also `writerow` takes an iterable so what you want to be passing is a list of data `writerow([" ".join(author) , title, data ,length , text])` – Padraic Cunningham May 30 '16 at 23:48
  • That's **very** generous of you, @PadraicCunningham. The current work is here: https://github.com/johnlaudun/tedtalks. There's a `test` directory with 3 files in it, which, I hope, gives a partial explanation of why my regex/soup code is so ugly... – John Laudun May 30 '16 at 23:53
  • No worries, I see a few things we can tidy up from a brief look but I will have a proper look tomorrow when I get a bit of free time. – Padraic Cunningham May 31 '16 at 00:10

1 Answers1

2

This will grab all the files and put the data into a csv, you just need to pass the path to the folder that contains the html files and the name of your output file:

import re
import csv
import os
from bs4 import BeautifulSoup
from glob import iglob


def parse(soup):
    # both title and author are can be parsed in separate tags.
    author = soup.select_one("h4.h12.talk-link__speaker").text
    title = soup.select_one("h4.h9.m5").text
    # just need to strip the text from the date string, no regex needed.
    date = soup.select_one("span.meta__val").text.strip()
    # we want the last time which is the talk-transcript__para__time previous to the footer.
    mn, sec = map(int, soup.select_one("footer.footer").find_previous("data", {
        "class": "talk-transcript__para__time"}).text.split(":"))
    length = (mn * 60 + sec)
    # to ignore time etc.. we can just pull from the actual text fragment and remove noise i.e (Applause).
    text = re.sub(r'\([^)]*\)',"", " ".join(d.text for d in soup.select("span.talk-transcript__fragment")))
    return author.strip(), title.strip(), date, length, re.sub('[^a-zA-Z\.\']', ' ', text)

def to_csv(patt, out):
    # open file to write to.
    with open(out, "w") as out:
        # create csv.writer.
        wr = csv.writer(out)
        # write our headers.
        wr.writerow(["author", "title", "date", "length", "text"])
        # get all our html files.
        for html in iglob(patt):
            with open(html, as f:
                # parse the file are write the data to a row.
                wr.writerow(parse(BeautifulSoup(f, "lxml")))

to_csv("./test/*.html","output.csv")
Padraic Cunningham
  • 176,452
  • 29
  • 245
  • 321
  • That is just lovely. Last night as I walked home, I realized it was time to try on some "big boy python pants" and move some of this into a function, so you've also given me a tutorial here on doing that. A couple of questions: first, I would like to credit you in the script. Is that okay? And how would you like to be credited? Second, I see you're encoding `author` and `title` explicity, which is is producing a `b'string'` in the csv. Why was that? Thank you again. – John Laudun May 31 '16 at 16:22
  • Not quite sure what's happening here. When I point it at the "big" directory, which has the exact same 3 files plus 2000 more, it throws an error `UnicodeDecodeError: 'utf-8' codec can't decode byte 0x80 in position 3131: invalid start byte`. I'm tracking this down now... – John Laudun May 31 '16 at 16:41
  • Fascinating. Point it at `./test` and it works. Point it at `./talks` and it doesn't. – John Laudun May 31 '16 at 16:44
  • I did not realise you were using python3, you can remove the . encodes. I ran the code on all the files but i was using python2. – Padraic Cunningham May 31 '16 at 17:07
  • I created a second test directory: this time with the first ten files. One of those files must have an odd character in it. If I assume that there will be other odd characters ... do you have any recommendations on how to clean it? (I should pose this as another question, I think.[?]) – John Laudun May 31 '16 at 21:04
  • @JohnLaudun, the issue is how the files are encoded, try setting the encoding to `'cp1252'`, `0x80` is a Euro sign in in windows 1252. – Padraic Cunningham May 31 '16 at 21:22
  • I narrowed it down, in the first set of ten, to the fourth text, so one after the three in the test set. Where do you suggest setting the encoding? – John Laudun May 31 '16 at 21:53
  • Run the edited code and add the error output, also can you share a link to where you downloaded the actual data from, I just want to check something – Padraic Cunningham May 31 '16 at 21:55
  • Let us [continue this discussion in chat](http://chat.stackoverflow.com/rooms/113463/discussion-between-john-laudun-and-padraic-cunningham). – John Laudun May 31 '16 at 22:09
  • I think I finally figured out how I was getting that error code: since the script points at a directory and processes its entire contents, it also wants to process the hidden `.DS_Store` file. I'm re-using the script [Padraic](http://stackoverflow.com/users/2141635/padraic-cunningham) created -- I almost wrote "helped" but, really, you wrote it, and I had to modify it in order to address new kinds of content, but in the process, the script is listing off files as it works -- I have no idea why. The error popped up and there in the Jupyter notebook console is `.DS-Store.` – John Laudun Jun 03 '16 at 15:49
  • @JohnLaudun, cooI, that makes sense. I made a little change to how we get the files from the directory, we can use `glob` to only match files with `.html` so you can have whichever files you like in the same directory. The listing of the files was just a leftover print which I also removed. – Padraic Cunningham Jun 03 '16 at 23:41