2

I've been trying to scrape Bandcamp fan pages to get a list of the albums they have purchased and I'm having trouble efficiently doing it. I wrote something with Selenium but it's mildly slow so I'd like to learn a solution that'd maybe send a POST request to the site and parse the JSON from there.

Here's a sample collection page: https://bandcamp.com/nhoward

Here's the Selenium code:

def scrapeFanCollection(url):
    browser = getBrowser()
    setattr(threadLocal, 'browser', browser)
    #Go to url
    browser.get(url)
    
    try:
        #Click show more button
        browser.find_element_by_class_name('show-more').click()
        
        #Wait two seconds
        time.sleep(2)
        #Scroll to the bottom loading full collection
        scroll(browser, 2)
    except Exception:
        pass
    
    #Return full album collection
    soup_a = BeautifulSoup(browser.page_source, 'lxml', parse_only=SoupStrainer('a', {"class": "item-link"}))
        
    #Empty array
    urls = []
    
    # Looping through all the a elements in the page source
    for item in soup_a.find_all('a', {"class": "item-link"}):
        url = item.get('href')
        if(url != None):
            urls.append(url)
    
    return urls
ggorlen
  • 44,755
  • 7
  • 76
  • 106
Trevor Fox
  • 31
  • 2
  • just edited with the code – Trevor Fox Oct 18 '20 at 21:41
  • Thanks. What's the output you want? – ggorlen Oct 18 '20 at 21:42
  • `curl -X POST -H "Content-Type: Application/JSON" -d '{"fan_id":7352,"older_than_token":"1586531374:1498564527:a::","count":10000}' https://bandcamp.com/api/fancollection/1/collection_items` works as shown [here](https://stackoverflow.com/questions/56518368/how-to-scrape-data-after-clicking-button) using the `requests` package. – ggorlen Oct 18 '20 at 22:06
  • oh so just set the count to higher! thanks! – Trevor Fox Oct 18 '20 at 22:22
  • actually that doesn't completely work. or rather is there a way to get the fan_id without running the request? it's in the datablob object it returns with the first request – Trevor Fox Oct 18 '20 at 22:42
  • Do you mean you don't know the fan id up front? Worst case scenario, make a request to get the fan id from the HTML, then use it to dynamically build the request. Since the linked example doesn't show this, I'm happy to provide an answer if you're still not sure. It'd be really helpful to know exactly what data you'd like to extract so I can be sure my answer actually does what you need and I don't have to guess or revise, but if you're happy with the JSON that works too. – ggorlen Oct 18 '20 at 22:47
  • Yeah there's nothing on the collection page source where you could get the fan id from. I'm also unsure of how to dynamically get the older_than_token either. Basically the data I'm looking for is at a minimum a link to each item in a fan's collection – Trevor Fox Oct 18 '20 at 23:14
  • As you mentioned earlier, the `id="pagedata"` element has a `data-blob` which is JSON and contains `["fan_data"]["fan_id"]` needed to make the POST request to the `collection_items` endpoint. I wrote an answer since it adds a decent amount to the linked solution above. – ggorlen Oct 19 '20 at 00:02

3 Answers3

1

The API can be accessed as follows:

$ curl -X POST -H "Content-Type: Application/JSON" -d \
'{"fan_id":82985,"older_than_token":"1586531374:1498564527:a::","count":10000}' \
https://bandcamp.com/api/fancollection/1/collection_items

I didn't encounter a scenario where a "older_than_token" was stale, so the problem boils down to getting the "fan_id" given a URL.

This information is located in a blob in the id="pagedata" element.

>>> import json
>>> import requests
>>> from bs4 import BeautifulSoup
>>> res = requests.get("https://www.bandcamp.com/ggorlen")
>>> soup = BeautifulSoup(res.text, "lxml")
>>> user = json.loads(soup.find(id="pagedata")["data-blob"])
>>> user["fan_data"]["fan_id"]
82985

Putting it all together (building upon this answer):

import json
import requests
from bs4 import BeautifulSoup

fan_page_url = "https://www.bandcamp.com/ggorlen"
collection_items_url = "https://bandcamp.com/api/fancollection/1/collection_items"
res = requests.get(fan_page_url)
soup = BeautifulSoup(res.text, "lxml")
user = json.loads(soup.find(id="pagedata")["data-blob"])

data = {
    "fan_id": user["fan_data"]["fan_id"],
    "older_than_token": user["wishlist_data"]["last_token"],
    "count": 10000,
}
res = requests.post(collection_items_url, json=data)
collection = res.json()

for item in collection["items"][:10]:
    print(item["album_title"], item["item_url"])

I'm using user["wishlist_data"]["last_token"] which has the same format as the "older_than_token" just in case this matters.

ggorlen
  • 44,755
  • 7
  • 76
  • 106
1

In order to get the entire collection i changed the previous code from

"older_than_token": user["wishlist_data"]["last_token"] to

user["collection_data"]["last_token"]

which contained the right token

  • Interesting--if you're referring to [my script](https://stackoverflow.com/a/64419449/6243352) in this thread, it still works fine for me. When I use your token, it seems to skip the first page of items. – ggorlen May 14 '23 at 20:36
-1

Unfortunately for you, this particular Bandcamp site doesn't seem to make any HTTP API call to fetch the list of albums. You can check that by using your browser developer tools, Network tab, click on XHR filter. The only call being made seems to be fetching your collection details.

Maks Babarowski
  • 652
  • 6
  • 16
  • Sure it does. See the comment above. It delivers the initial batch statically in HTML then there are additional requests made to get the rest of the collection and images when you click to view the whole collection and scroll. – ggorlen Oct 18 '20 at 21:47