0

I've got working text scraper from one URL. The problem is that I need to scrape 25 more urls. These urls are almost the same, only difference is last letter. Here's the code to be more clear:

import urllib2
from bs4 import BeautifulSoup

f = open('(path to file)/names', 'a')
links = ['http://guardsmanbob.com/media/playlist.php?char='+ chr(i) for i in range(97,123)]

response = urllib2.urlopen(links[0]).read()
soup = BeautifulSoup(response)

for tr in soup.findAll('tr'):
    if not tr.find('td'): continue
    for td in tr.find('td').findAll('a'):
        f.write(td.contents[0] + '\n')

I can't make this script to run all urls from list in one time. All what I managed to get is first song name of each url. Sorry for my English. I hope you understand me.

Jon Clements
  • 138,671
  • 33
  • 247
  • 280
ignassz
  • 15
  • 1
  • 5

1 Answers1

1

I can't make this script to run all urls from list in one time.

Save your code in a method with one parameter, *args (or whatever name you wanted, just don't forget the *). The * will automatically unpack your list. There is no official name for *, however, some people (including me) is fond of calling it the splat operator.

def start_download(*args):
    for value in args:
        ##for debugging purposes
        ##print value

        response = urllib2.urlopen(value).read()
        ##put the rest of your code here

if __name__ == '__main__':
    links = ['http://guardsmanbob.com/media/playlist.php?char='+ 
              chr(i) for i in range(97,123)]

    start_download(links)

Edit: Or you could just directly loop over your list of links and download each.

    links = ['http://guardsmanbob.com/media/playlist.php?char='+ 
              chr(i) for i in range(97,123)]
    for link in links:
         response = urllib2.urlopen(link).read()
         ##put the rest of your code here

Edit 2:

For getting all the links and then saving them in file, here's the entire code with specific comments:

import urllib2
from bs4 import BeautifulSoup, SoupStrainer

links = ['http://guardsmanbob.com/media/playlist.php?char='+ 
          chr(i) for i in range(97,123)]

    for link in links:
         response = urllib2.urlopen(link).read()
         ## gets all <a> tags
         soup = BeautifulSoup(response, parse_only=SoupStrainer('a'))
         ## unnecessary link texts to be removed
         not_included = ['News', 'FAQ', 'Stream', 'Chat', 'Media',
                    'League of Legends', 'Forum', 'Latest', 'Wallpapers',
                    'Links', 'Playlist', 'Sessions', 'BobRadio', 'All',
                    'A', 'B', 'C', 'D', 'E', 'F', 'G', 'H', 'I', 'J',
                    'K', 'L', 'M', 'N', 'O', 'P', 'Q', 'R', 'S', 'T',
                    'U', 'V', 'W', 'X', 'Y', 'Z', 'Misc', 'Play',
                    'Learn more about me', 'Chat info', 'Boblights',
                    'Music Playlist', 'Official Facebook',
                    'Latest Music Played', 'Muppets - Closing Theme',
                    'Billy Joel - The River Of Dreams',
                    'Manic Street Preachers - If You Tolerate This 
                     Your Children Will Be Next',
                    'The Bravery - An Honest Mistake', 
                    'The Black Keys - Strange Times',
                    'View whole playlist', 'View latest sessions', 
                    'Referral Link', 'Donate to BoB', 
                    'Guardsman Bob', 'Website template', 
                    'Arcsin']

         ## create a file named "test.txt"
         ## write to file and close afterwards
         with open("test.txt", 'w') as output:
             for hyperlink in soup:
                if hyperlink.text:
                    if hyperlink.text not in not_included:
                        ##print hyperlink.text
                        output.write("%s\n" % hyperlink.text.encode('utf-8')) 

Here's the output saved in test.txt:

enter image description here

I suggest you change test.txt into a different filename (ex S Song Titles), everytime you loop your list of links because it overwrites the previous one.

Community
  • 1
  • 1
Annie Lagang
  • 3,185
  • 1
  • 29
  • 36
  • While using first method, I get this: AttributeError: 'list' object has no attribute 'timeout'. And while using second method, I get only first song name of each url. How can I fix that? – ignassz Jan 25 '13 at 00:30
  • Ok, so you were able to iterate over your list of links using the 2nd method. Then you wanted to get all the song name for each url, am I correct? – Annie Lagang Jan 25 '13 at 00:36
  • Then I assume you wanted to save the links in a file? OK, will edit my answer. – Annie Lagang Jan 25 '13 at 00:41
  • I don't need links. All what I need is just get the song name and put it into the file. And with second method I can do this, I just get only first song, not all. – ignassz Jan 25 '13 at 00:47
  • Oops, I mean you wanted to save the link text (song title) in a file. :) – Annie Lagang Jan 25 '13 at 01:27