How To Scrape Similiar Classes With One Different Attribute

Question

Searched around on SO, but couldn't find anything for this.

I'm scraping using beautifulsoup... This is the code I'm using which I found on SO:

for section in soup.findAll('div',attrs={'id':'dmusic_tracklist_track_title_B00KHQOKGW'}):
    nextNode = section
    while True:
        nextNode = nextNode.nextSibling
        try:
            tag_name = nextNode.name
        except AttributeError:
            tag_name = ""
        if tag_name == "a":
            print nextNode.text()
        else:
            print "*****"
            break

If went to this 50 Cent album (Animal Ambition: An Untamed Desire To Win) and wanted to scrape each song, how would I do so? The problem is each song has a different ID associated with it based upon its product code. For example, here is the XPath of the first two songs' titles: //*[@id="dmusic_tracklist_track_title_B00KHQOKGW"]/div/a/text() and //*[@id="dmusic_tracklist_track_title_B00KHQOLWK"]/div/a/text().

You'll notice the end of the first id is B00KHQOKGW, while the second is B00KHQOLWK. Is there a way I can add a "wild card to the end of the id to grab each of the songs no matter what product id is at the end? For example, something like id="dmusic_tracklist_track_title_*... I replaced the product ID with a *.

Or can I use a div to target the title I want like this (I feel like this would be the best. It uses the div's class right above the title. There isn't any product ID in it):

for section in soup.findAll('div',attrs={'class':'a-section a-spacing-none overflow_ellipsis'}):
    nextNode = section
    while True:
        nextNode = nextNode.nextSibling
        try:
            tag_name = nextNode.name
        except AttributeError:
            tag_name = ""
        if tag_name == "a":
            print nextNode.text()
        else:
            print "*****"
            break

You should be able to use the xpath starts-with function: `[starts-with(@id, "dmusic_tracklist_track_title_B00K")]` — tdelaney, Sep 23 '14 at 02:23
@tdelaney the question is clearly `BeautifulSoup` specific which doesn't support xpath. Not sure why the OP tagged the question as xpath. — alecxe, Sep 23 '14 at 03:12
@alecxe - the beautifulsoup parser in lxml builds a tree that does support xpath. I was just commenting from the given xpath and didn't think too much about how it got there! — tdelaney, Sep 23 '14 at 03:34

alecxe · Answer 1 · 2014-09-23T23:04:12.143

1

You can pass a function as an id attribute value and check if it starts with dmusic_tracklist_track_title_:

from bs4 import BeautifulSoup
import requests

headers = {'User-Agent': 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_9_4) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/37.0.2062.122 Safari/537.36'}
response = requests.get('http://www.amazon.com/dp/B00KHQOI8C/?tag=stackoverfl08-20', headers=headers)

soup = BeautifulSoup(response.content)
for song in soup.find_all(id=lambda x: x and x.startswith('dmusic_tracklist_track_title_')):
    print song.text.strip()

Prints:

Hold On [Explicit]
Don't Worry 'Bout It [feat. Yo Gotti] [Explicit]
Animal Ambition [Explicit]
Pilot [Explicit]
Smoke [feat. Trey Songz] [Explicit]
Everytime I Come Around [feat. Kidd Kidd] [Explicit]
Irregular Heartbeat [feat. Jadakiss] [Explicit]
Hustler [Explicit]
Twisted [feat. Mr. Probz] [Explicit]
Winners Circle [feat. Guordan Banks] [Explicit]
Chase The Paper [feat. Kidd Kidd] [Explicit]

Alternatively, you can pass a regular expression pattern as an attribute value:

import re
from bs4 import BeautifulSoup
import requests

headers = {'User-Agent': 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_9_4) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/37.0.2062.122 Safari/537.36'}
response = requests.get('http://www.amazon.com/dp/B00KHQOI8C/?tag=stackoverfl08-20', headers=headers)

soup = BeautifulSoup(response.content)
for song in soup.find_all(id=re.compile('^dmusic_tracklist_track_title_\w+$')):
    print song.text.strip()

^dmusic_tracklist_track_title_\w+$ would match dmusic_tracklist_track_title_ followed by 1 or more "alphanumeric" (0-9a-zA-Z and _) characters.

edited Sep 23 '14 at 23:04

answered Sep 23 '14 at 03:10

alecxe

462,703
120
1,088
1,195

Awesome, thanks so much. Can you explain the `id=lambda x: x and` part of `soup.find_all(id=lambda x: x and x.startswith('dmusic_tracklist_track_title_')` to me? I'm having finding documentation to explain it to me. Especially the `lambda x` part. – Slightly Not Average Sep 23 '14 at 05:38
@SlightlyNotAverage `lambda` is basically a short inline way to write a function, anonymous function (see [docs](https://docs.python.org/2/reference/expressions.html#lambda)). – alecxe Sep 23 '14 at 11:27
Say I don't want to use lambda for simplicity's sake... Would there be a problem with just using this? `def f (x): x and x` then `for song in soup.find_all(x):` and the next line `print song.text.strip()` – Slightly Not Average Sep 23 '14 at 23:01
@SlightlyNotAverage ok, added another option (regular expression based approach). – alecxe Sep 23 '14 at 23:05
sorry to be pesty, but I've read that regex isn't good for scraping... that's why I want to know how to use the `startswith()` function... It's just that the `lambda` is confusing. I think I understand it now, but I want to know the way to do it without `lambda` while still using `startswith()`. Thanks. – Slightly Not Average Sep 23 '14 at 23:08
@SlightlyNotAverage about regexes: you are confusing parsing HTML with regexes, which is yes, should be avoided, and the use case we have here - we are parsing an attribute value - a string. It is perfectly ok to use regex for this. – alecxe Sep 23 '14 at 23:10
OK. As my original question stated, I want to scrape the entire song. I want to scrape the title and time of the first song, then load that to my database for retrieval. Then, I want to do the next song on the list. After I looked at this further, this just gets all of the songs at once. I want to loop through each song in the list and get the attributes for it that I need. Some songs ([The Jimi Hendrix Experience](http://www.amazon.com/Experience-Hendrix-Best-Jimi/dp/B00307OTDE/ref=sr_sp-atf_image_1_1?s=dmusic&ie=UTF8&qid=1411536279&sr=1-1)) have an artist column. I want to scrape that. – Slightly Not Average Sep 24 '14 at 05:26
Sorry for the double comment, but I was running out of space. This is similiar to what I meant: http://stackoverflow.com/questions/11647348/find-next-siblings-until-a-certain-one-using-beautifulsoup I just need it to iterate through each one so I can add each individual song to my database. – Slightly Not Average Sep 24 '14 at 05:43
did you get my comment? – Slightly Not Average Sep 25 '14 at 19:02

How To Scrape Similiar Classes With One Different Attribute

1 Answers1