Scrape just the text, within an html element that has a class, using beautiful soup

Question

I'm trying to scrape a page using BeatifulSoup

import urllib2
from bs4 import BeautifulSoup

url='http://www.xpn.org/playlists/xpn-playlist'
page = urllib2.urlopen(url)

soup = BeautifulSoup(page.read())

for link in soup.find_all("li", class_="song"):
    print link

The problem is the text I would like to return is not enclosed in it's own html tag

<li class="song"> <a href="/default.htm" onclick="return clickreturnvalue()
" onmouseout="delayhidemenu()" onmouseover="dropdownmenu(this, event, menu1, 
'100px','Death Vessel','Mandan Dink','Stay Close')">Buy</a>  
Chuck Ragan - Rotterdam - Folkadelphia Session</li>

What I want to return Chuck Ragan - Rotterdam - Folkadelphia Session

Bonus Points: The data returned is of the format Artist/Song/Album. What would be the proper data structure to use to store and manipulate this info?

Remi Guan · Accepted Answer · 2015-10-22T04:22:23.257

1

Try something like:

for link in soup.find_all("li", class_="song"):
    print link.text

Output:

Buy  Chuck Ragan - Rotterdam - Folkadelphia Session

Sure, if you want to remove Buy, you can use slice like this:

for link in soup.find_all("li", class_="song"):
    print link.text.strip()[5:]

The output is:

Chuck Ragan - Rotterdam - Folkadelphia Session

If you'd like save these string in a list:

[i.strip() for i in link.text.strip()[5:].split('-')]

Output:

['Chuck Ragan', 'Rotterdam', 'Folkadelphia Session']

For more info, you can check the document.

edited Oct 22 '15 at 04:22

answered Oct 22 '15 at 04:02

Remi Guan

21,506
17
64
87

1

That's remove the first 5 characters. see [this question](http://stackoverflow.com/questions/509211/explain-pythons-slice-notation). – Remi Guan Oct 22 '15 at 04:13
And about *What would be the proper data structure to use to store and manipulate this info?*, maybe use database? – Remi Guan Oct 22 '15 at 04:15
Thinking more along the lines of dictionaries, sets, maps and the like. I'm just not sure how to store something with three values. In other words I would use a dict if it was just a key : value pair and I'm wondering if there is an analogue for three values. – Michael Queue Oct 22 '15 at 04:17
1

Beautiful! Thanks for the really complete answer. – Michael Queue Oct 22 '15 at 04:24
@MichaelJames Hmm...I think that use database here is a good idea, that will be more simpler. `sqlite3` is a good choice. – Remi Guan Oct 22 '15 at 04:24

score 1 · Answer 2 · answered Oct 22 '15 at 04:33

1

Here is another way! (assuming li has 3 children. If not, change [2] to [1]):

>>> html = '''<li class="song"> <a href="/default.htm" onclick="return clickreturnvalue()
... " onmouseout="delayhidemenu()" onmouseover="dropdownmenu(this, event, menu1,
... '100px','Death Vessel','Mandan Dink','Stay Close')">Buy</a>
... Chuck Ragan - Rotterdam - Folkadelphia Session</li>'''

>>> from bs4 import BeautifulSoup as bs
>>> all_li = soup.findAll('li', class_='song')
>>> for li in all_li:
...     text = list(li.children)[2]
...     artist, song, album = text.split('-')
...     print artist, song, album
Chuck Ragan   Rotterdam   Folkadelphia Session

answered Oct 22 '15 at 04:33

Aziz Alto

19,057
5
77
60

Tried this solution but got an error `----> 9 all_li = soup.findall('li', class_='song')` `TypeError: 'NoneType' object is not callable` – Michael Queue Oct 22 '15 at 20:03
Notice that `soup.findAll()` and `soup.findall()` are different functions! The one we are looking for is `soup.findAll()` not the one you tried :-) – Aziz Alto Oct 22 '15 at 21:44

score 0 · Answer 3 · answered Oct 22 '15 at 04:11

0

You could use something like this.

for l in soup.find_all("li", class_="song"):
    album = l.text.split("-")[2]
    song = l.text.split("-")[1]
    artist = l.text.split("-")[0].split(" ")[1]

answered Oct 22 '15 at 04:11

JRodDynamite

12,325
5
43
63

score 0 · Answer 4 · answered Nov 24 '15 at 08:29

** Ended up using a named tuple for storage **

from bs4 import BeautifulSoup
import urllib2
from collections import namedtuple

url='http://www.xpn.org/playlists/xpn-playlist'
page = urllib2.urlopen(url)


soup = BeautifulSoup(page.read())

songs=[]
Song = namedtuple("Song", "artist name album")
for link in soup.find_all("li", class_="song"):
    song = Song._make(link.text.strip()[12:].split(" - "))
    songs.append(song)

for song in songs:
    print(song.artist, song.name, song.album)

Scrape just the text, within an html element that has a class, using beautiful soup

4 Answers4