2

I'm trying to get information from many different tables from an HTML url without any of the HTML indent/tab formatting. I use get_text to generate the content I want, but it prints with a lot of white space and tabs. I've tried .strip and that doesn't accomplish what I want.

Here's the python script I'm using:

import csv, simplejson, urllib,
url="http://www.thecomedystudio.com/schedule.html"
response=urllib.urlopen(url)
from bs4 import BeautifulSoup
html = response
soup = BeautifulSoup(html.read())
text = soup.get_text()
print text

In the end, I'd like to create a csv of the event calendar, but first I'd like to create a .txt or something that doesn't require too much manual cleaning.

Any help appreciated.

alecxe
  • 462,703
  • 120
  • 1,088
  • 1,195
Huessy
  • 111
  • 8

1 Answers1

1

You don't need to "clean up" the HTML in order to parse it with BeautifulSoup.

Just parse the dates and events into a csv file directly:

import csv
from urllib2 import urlopen

from bs4 import BeautifulSoup


url = "http://www.thecomedystudio.com/schedule.html"
soup = BeautifulSoup(urlopen(url))

with open('output.csv', 'wb') as f:
    writer = csv.writer(f)

    for item in soup.select('td div[align=center] > b'):
        date = ' '.join(el.strip() for el in item.find_all(text=True))
        event = item.parent.parent.find_next_sibling('td').get_text(strip=True)

        writer.writerow([date, event])

This contents of output.csv after running the script:

Fri 2.27.15,"Rick Canavan hosts with Christine An, Rachel Bloom, Dan Crohn, Wes Hazard, James Huessy, Kelly MacFarland, Peter Martin, Ted Pettingell."
Sat 2.28.15,"Rick Jenkins hosts Taylor Connelly, Lilian DeVane, Andrew Durso, Nate Johnson, Peter Martin, Andrew Mayer, Kofi Thomas, Tim Willis."
Sun 3.1.15,"Peter Martin hosts Sunday Funnies with Nonye Brown-West, Ryan Donahue, Joe Kozlowski, Casey Malone, Etrane Martinez, Kwasi Mensah, Anthony Zonfrelli, Christa Weiss and Sam Jay closing."
Tue 3.3.15,Mystery Lounge! The old-est and only-est magic show in New England! with guest comedian Ryan Donahue.
...
Thu 12.31.15,"New Year's Eve! with Rick Jenkins, Nathan Burke."
Fri 1.1.16,Rick Canavan hosts New Year's Day.
alecxe
  • 462,703
  • 120
  • 1,088
  • 1,195
  • Tried running your very useful script and ran into this error: `C:\Python>python ComedyStudio.py Traceback (most recent call last): File "ComedyStudio.py", line 13, in for item in soup.select('td div[align=center] > b'): File "C:\Python27\lib\site-packages\bs4\element.py", line 1370, in select for candidate in _use_candidate_generator(tag): File "C:\Python27\lib\site-packages\bs4\element.py", line 1198, in descendants current = current.next_element AttributeError: 'NoneType' object has no attribute 'next_element'` Is my version of bs4 wrong somehow? – Huessy Mar 04 '15 at 20:16
  • @Huessy do you have `lxml` or `html5lib` modules installed? If not, please install and rerun the script again. – alecxe Mar 04 '15 at 20:22