0

I'm running some web scraping, and now have a list of 911 links saved in the following (I included 5 to demonstrate how they're stored):

every_link = ['http://www.millercenter.org/president/obama/speeches/speech-4427', 'http://www.millercenter.org/president/obama/speeches/speech-4425', 'http://www.millercenter.org/president/obama/speeches/speech-4424', 'http://www.millercenter.org/president/obama/speeches/speech-4423', 'http://www.millercenter.org/president/obama/speeches/speech-4453']

These URLs link to presidential speeches over time. I want to store each individual speech (so, 911 unique speeches) in different text files, or be able to group by president. I'm trying to pass the following function on to these links:

def processURL(l):
    open_url = urllib2.urlopen(l).read()
    item_soup = BeautifulSoup(open_url)
    item_div = item_soup.find('div',{'id': 'transcript'},{'class': 'displaytext'})
    item_str = item_div.text.lower()
    item_str_processed = punctuation.sub('',item_str)
    item_str_processed_final = item_str_processed.replace('—',' ')

for l in every_link:
    processURL(l)

So, I would want to save to unique text files words from the all the processed speeches. This might look like the following, with obama_44xx representing individual text files:

obama_4427 = "blah blah blah"
obama_4425 = "blah blah blah"
obama_4424 = "blah blah blah"
...

I'm trying the following:

for l in every_link:
    processURL(l)
    obama.write(processURL(l))

But that's not working... Is there another way I should go about this?

blacksite
  • 12,086
  • 10
  • 64
  • 109
  • It's nothing, yet. I am trying to figure out how best to store 911 different iterations of `processURL`. So, perhaps for speeches by Obama, I could save one of his speeches to one file (e.g. `obama_4427`) and another to another file (e.g. `obama_4428`), etc. – blacksite Sep 23 '15 at 19:11
  • Are all the links of the same format? i.e. `http://www.millercenter.org/president//speeches/speech-`? – wpercy Sep 23 '15 at 19:14
  • Yes. The speech numbers seem more random (i.e. the numbers they used to store Obama's speeches likely differ from those they used to store Clinton's speeches). – blacksite Sep 23 '15 at 19:17

2 Answers2

1

Okay, so you have a couple of issues. First of all, your processURL function doesn't actually return anything, so when you try to write the return value of the function, it's going to be None. Maybe try something like this:

def processURL(link):
    open_url = urllib2.urlopen(link).read()
    item_soup = BeautifulSoup(open_url)
    item_div = item_soup.find('div',{'id': 'transcript'},{'class': 'displaytext'})
    item_str = item_div.text.lower()
    item_str_processed = punctuation.sub('',item_str)
    item_str_processed_final = item_str_processed.replace('—',' ')

    splitlink = link.split("/")
    president = splitlink[4]
    speech_num = splitlink[-1].split("-")[1]
    filename = "{0}_{1}".format(president, speech_num)

    return filename, item_str_processed_final # returning a tuple

for link in every_link:
    filename, content = processURL(link) # yay tuple unpacking
    with open(filename, 'w') as f:
        f.write(content)

This will write each file to a filename that looks like president_number. So for example, it will write Obama's speech with id number 4427 to a file called obama_4427. Lemme know if that works!

wpercy
  • 9,636
  • 4
  • 33
  • 45
  • 1
    That's an elegant solution - thank you! It wrote the scraped text to text files and stored them in my U:/ drive. That's exactly what I was looking for... Thanks, again. – blacksite Sep 23 '15 at 19:37
1

You have to call the processURL function and have it return the text you want written. After that, you simply have to add the writing to disk code within the loop. Something like this:

def processURL(l):
    open_url = urllib2.urlopen(l).read()
    item_soup = BeautifulSoup(open_url)
    item_div = item_soup.find('div',{'id': 'transcript'},{'class': 'displaytext'})
    item_str = item_div.text.lower()
    #item_str_processed = punctuation.sub('',item_str)
    #item_str_processed_final = item_str_processed.replace('—',' ')
    return item_str

for l in every_link:
    speech_text = processURL(l).encode('utf-8').decode('ascii', 'ignore')
    speech_num = l.split("-")[1]
    with open("obama_"+speech_num+".txt", 'w') as f:
        f.write(speech_text)

The .encode('utf-8').decode('ascii', 'ignore') is purely for dealing with non-ascii characters in the text. Ideally you would handle them in a different way, but that depends on your needs (see Python: Convert Unicode to ASCII without errors).

Btw, the 2nd link in your list is 404. You should make sure your script can handle that.

Jason Q. Ng
  • 110
  • 6
  • 1
    When I initially posted the links, there were about 1600 duplicates that I hadn't noticed before. So, I just came up with some random 44xx directory for purposes of demonstration. And, I dealt with the ASCII issue earlier in the script. Thanks for your help! – blacksite Sep 23 '15 at 19:41