3
source_code = requests.get('http://en.wikipedia.org/wiki/Taylor_Swift_discography')
soup = BeautifulSoup(source_code.text)
tables = soup.find_all("table")

I'm trying to get a list of song names from the table "List of Singles" at Taylor Swift's discography

The table has no unique class or id. The only unique thing I can think of is the caption tag around "List of singles..."

List of singles as main artist, with selected chart positions, sales figures and certifications

I tried:

table = soup.find_all("caption")

but it returns nothing, i'm assuming that caption is not a recognized tag in bs4?

alecxe
  • 462,703
  • 120
  • 1,088
  • 1,195
UdaraW
  • 98
  • 1
  • 8

2 Answers2

3

It is actually nothing to do with findAll() and find_all(). findAll() was used in BeautifulSoup3 and was left in BeautifulSoup4 for compatibility reasons, quote from the bs4's source code:

def find_all(self, name=None, attrs={}, recursive=True, text=None,
             limit=None, **kwargs):
    generator = self.descendants
    if not recursive:
        generator = self.children
    return self._find_all(name, attrs, text, limit, generator, **kwargs)

findAll = find_all       # BS3

And, there is a nicer way to get the list of singles, relying on the span element with id="Singles" that indicates the start of Singles paragraph. Then, use the find_next_sibling() to get the first table after the span tag's parent. Then, get all th elements with scope="row":

from bs4 import BeautifulSoup
import requests


source_code = requests.get('http://en.wikipedia.org/wiki/Taylor_Swift_discography')
soup = BeautifulSoup(source_code.content)

table = soup.find('span', id='Singles').parent.find_next_sibling('table')
for single in table.find_all('th', scope='row'):
    print(single.text)

Prints:

"Tim McGraw"
"Teardrops on My Guitar"
"Our Song"
"Picture to Burn"
"Should've Said No"
"Change"
"Love Story"
"White Horse"
"You Belong with Me"
"Fifteen"
"Fearless"
"Today Was a Fairytale"
"Mine"
"Back to December"
"Mean"
"The Story of Us"
"Sparks Fly"
"Ours"
"Safe & Sound"
(featuring The Civil Wars)
"Long Live"
(featuring Paula Fernandes)
"Eyes Open"
"We Are Never Ever Getting Back Together"
"Ronan"
"Begin Again"
"I Knew You Were Trouble"
"22"
"Highway Don't Care"
(with Tim McGraw)
"Red"
"Everything Has Changed"
(featuring Ed Sheeran)
"Sweeter Than Fiction"
"The Last Time"
(featuring Gary Lightbody)
"Shake It Off"
"Blank Space"
alecxe
  • 462,703
  • 120
  • 1,088
  • 1,195
1

Here is a complete example that solves the "Taylor Swift problem". First look for the caption that contains the text "List of singles" and move to the parent object". Next iterate over the items that have the text you are looking for:

for caption in soup.findAll("caption"):
    if "List of singles" in caption.text:      
        break

table = caption.parent
for item in table.findAll("th", {"scope":"row"}):
    print item.text

This gives:

"Tim McGraw"
"Teardrops on My Guitar"
"Our Song"
"Picture to Burn"
"Should've Said No"
"Change"
"Love Story"
"White Horse"
"You Belong with Me"
"Fifteen"
"Fearless"
"Today Was a Fairytale"
...
Hooked
  • 84,485
  • 43
  • 192
  • 261
  • Thank you so much! I thought that findAll and find_all did the same thing because of this. http://stackoverflow.com/questions/12339323/beautifulsoup-findall-find-all – UdaraW Nov 06 '14 at 21:04
  • @user2985522 I guess it depends on the version you are using: http://www.crummy.com/software/BeautifulSoup/bs4/doc/#porting-code-to-bs4. If you are using the older BeautifulSoup it is `find_all`, but if you are using bs4 (and you should be) it is `findAll`. – Hooked Nov 06 '14 at 21:11
  • @Hooked: isn't that backwards? I was pretty sure that `findAll` was the older one. – DSM Nov 06 '14 at 21:23
  • @DSM, opps that is correct. I'll update my answer accordingly. – Hooked Nov 06 '14 at 21:30