0

Searched around on SO, but couldn't find anything for this.

I'm scraping using beautifulsoup... This is the code I'm using which I found on SO:

for section in soup.findAll('div',attrs={'id':'dmusic_tracklist_track_title_B00KHQOKGW'}):
    nextNode = section
    while True:
        nextNode = nextNode.nextSibling
        try:
            tag_name = nextNode.name
        except AttributeError:
            tag_name = ""
        if tag_name == "a":
            print nextNode.text()
        else:
            print "*****"
            break

If went to this 50 Cent album (Animal Ambition: An Untamed Desire To Win) and wanted to scrape each song, how would I do so? The problem is each song has a different ID associated with it based upon its product code. For example, here is the XPath of the first two songs' titles: //*[@id="dmusic_tracklist_track_title_B00KHQOKGW"]/div/a/text() and //*[@id="dmusic_tracklist_track_title_B00KHQOLWK"]/div/a/text().

You'll notice the end of the first id is B00KHQOKGW, while the second is B00KHQOLWK. Is there a way I can add a "wild card to the end of the id to grab each of the songs no matter what product id is at the end? For example, something like id="dmusic_tracklist_track_title_*... I replaced the product ID with a *.

Or can I use a div to target the title I want like this (I feel like this would be the best. It uses the div's class right above the title. There isn't any product ID in it):

for section in soup.findAll('div',attrs={'class':'a-section a-spacing-none overflow_ellipsis'}):
    nextNode = section
    while True:
        nextNode = nextNode.nextSibling
        try:
            tag_name = nextNode.name
        except AttributeError:
            tag_name = ""
        if tag_name == "a":
            print nextNode.text()
        else:
            print "*****"
            break
  • You should be able to use the xpath starts-with function: `[starts-with(@id, "dmusic_tracklist_track_title_B00K")]` – tdelaney Sep 23 '14 at 02:23
  • @tdelaney the question is clearly `BeautifulSoup` specific which doesn't support xpath. Not sure why the OP tagged the question as xpath. – alecxe Sep 23 '14 at 03:12
  • @alecxe - the beautifulsoup parser in lxml builds a tree that does support xpath. I was just commenting from the given xpath and didn't think too much about how it got there! – tdelaney Sep 23 '14 at 03:34

1 Answers1

1

You can pass a function as an id attribute value and check if it starts with dmusic_tracklist_track_title_:

from bs4 import BeautifulSoup
import requests

headers = {'User-Agent': 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_9_4) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/37.0.2062.122 Safari/537.36'}
response = requests.get('http://www.amazon.com/dp/B00KHQOI8C/?tag=stackoverfl08-20', headers=headers)

soup = BeautifulSoup(response.content)
for song in soup.find_all(id=lambda x: x and x.startswith('dmusic_tracklist_track_title_')):
    print song.text.strip()

Prints:

Hold On [Explicit]
Don't Worry 'Bout It [feat. Yo Gotti] [Explicit]
Animal Ambition [Explicit]
Pilot [Explicit]
Smoke [feat. Trey Songz] [Explicit]
Everytime I Come Around [feat. Kidd Kidd] [Explicit]
Irregular Heartbeat [feat. Jadakiss] [Explicit]
Hustler [Explicit]
Twisted [feat. Mr. Probz] [Explicit]
Winners Circle [feat. Guordan Banks] [Explicit]
Chase The Paper [feat. Kidd Kidd] [Explicit]

Alternatively, you can pass a regular expression pattern as an attribute value:

import re
from bs4 import BeautifulSoup
import requests

headers = {'User-Agent': 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_9_4) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/37.0.2062.122 Safari/537.36'}
response = requests.get('http://www.amazon.com/dp/B00KHQOI8C/?tag=stackoverfl08-20', headers=headers)

soup = BeautifulSoup(response.content)
for song in soup.find_all(id=re.compile('^dmusic_tracklist_track_title_\w+$')):
    print song.text.strip()

^dmusic_tracklist_track_title_\w+$ would match dmusic_tracklist_track_title_ followed by 1 or more "alphanumeric" (0-9a-zA-Z and _) characters.

alecxe
  • 462,703
  • 120
  • 1,088
  • 1,195
  • Awesome, thanks so much. Can you explain the `id=lambda x: x and` part of `soup.find_all(id=lambda x: x and x.startswith('dmusic_tracklist_track_title_')` to me? I'm having finding documentation to explain it to me. Especially the `lambda x` part. – Slightly Not Average Sep 23 '14 at 05:38
  • @SlightlyNotAverage `lambda` is basically a short inline way to write a function, anonymous function (see [docs](https://docs.python.org/2/reference/expressions.html#lambda)). – alecxe Sep 23 '14 at 11:27
  • Say I don't want to use lambda for simplicity's sake... Would there be a problem with just using this? `def f (x): x and x` then `for song in soup.find_all(x):` and the next line `print song.text.strip()` – Slightly Not Average Sep 23 '14 at 23:01
  • @SlightlyNotAverage ok, added another option (regular expression based approach). – alecxe Sep 23 '14 at 23:05
  • sorry to be pesty, but I've read that regex isn't good for scraping... that's why I want to know how to use the `startswith()` function... It's just that the `lambda` is confusing. I think I understand it now, but I want to know the way to do it without `lambda` while still using `startswith()`. Thanks. – Slightly Not Average Sep 23 '14 at 23:08
  • @SlightlyNotAverage about regexes: you are confusing parsing HTML with regexes, which is yes, should be avoided, and the use case we have here - we are parsing an attribute value - a string. It is perfectly ok to use regex for this. – alecxe Sep 23 '14 at 23:10
  • OK. As my original question stated, I want to scrape the entire song. I want to scrape the title and time of the first song, then load that to my database for retrieval. Then, I want to do the next song on the list. After I looked at this further, this just gets all of the songs at once. I want to loop through each song in the list and get the attributes for it that I need. Some songs ([The Jimi Hendrix Experience](http://www.amazon.com/Experience-Hendrix-Best-Jimi/dp/B00307OTDE/ref=sr_sp-atf_image_1_1?s=dmusic&ie=UTF8&qid=1411536279&sr=1-1)) have an artist column. I want to scrape that. – Slightly Not Average Sep 24 '14 at 05:26
  • Sorry for the double comment, but I was running out of space. This is similiar to what I meant: http://stackoverflow.com/questions/11647348/find-next-siblings-until-a-certain-one-using-beautifulsoup I just need it to iterate through each one so I can add each individual song to my database. – Slightly Not Average Sep 24 '14 at 05:43
  • did you get my comment? – Slightly Not Average Sep 25 '14 at 19:02