How do I filter out .mp3 links using beautifulsoup from (possibly) broken html? (JSON)

Question

I want to build small tool to help a family member download podcasts off a site.

In order to get the links to the files I first need to filter them out (with bs4 + python3). The files are on this website (Estonian): Download Page "Laadi alla" = "Download"

So far my code is as follows: (most of it is from examples on stackoverflow)

from bs4 import BeautifulSoup

import urllib.request
import re

url = urllib.request.urlopen("http://vikerraadio.err.ee/listing/mystiline_venemaa#?page=1&pagesize=902&phrase=&from=&to=&path=mystiline_venemaa&showAll")
content = url.read()
soup = BeautifulSoup(content, "lxml")

links = [a['href'] for a in soup.find_all('a',href=re.compile('http.*\.mp3'))]
print ("Links:", links)

Unfortunately I always get only two results. Output:

Links: ['http://heli.err.ee/helid/exp/ERR_raadiouudised.mp3', 'http://heli.err.ee/helid/exp/ERR_raadiouudised.mp3']

These are not the ones I want. My best guess is that the page has somewhat broken html and bs4 / the parser is not able to find anything else. I've tried different parsers with resulting in no change. Maybe I'm doing something else wrong too.

My goal is to have the individual links in a list for example. I'll filter out any duplicates / unwanted entries later myself.

Just a quick note, just in case: This is a public radio and all the content is legally hosted.

My new code is:

for link in soup.find_all('d2p1:DownloadUrl'): 
    print(link.text)

I am very unsure if the tag is selected correctly.

None of the examples listed in this question are actually working. See the answer below for working code.

The page is rendered with JavaScript see my answer to https://stackoverflow.com/questions/45259232/scraping-google-finance-beautifulsoup/45259523#45259523 for details of how to scrape web pages rendered with JavaScript — Dan-Dev, Jul 28 '17 at 12:52

ExtractTable.com · Accepted Answer · 2017-07-28T15:32:06.647

3

Please be aware that the listings from the page are interfaced through an API. So instead of requesting the HTML page, I suggest you to request the API link which has 200 .mp3 links.

Please follow the below steps:

Request the API link, not the HTML page link
Check the response, it's a JSON. So extract the fields that are of your need
Help your Family, All Time :)

Solution

import requests, json
from bs4 import BeautifulSoup

myurl = 'http://vikerraadio.err.ee/api/listing/bypath?path=mystiline_venemaa&page=1&pagesize=200&phrase=&from=&to=&showAll=false'
r = requests.get(myurl)
abc = json.loads(r.text)

all_mp3 = {}
for lstngs in abc['ListItems']:
    for asd in lstngs['Podcasts']:
        all_mp3[asd['DownloadUrl']] = lstngs['Header']

all_mp3

all_mp3 is what you need. all_mp3 is a dictionary with download urls as keys and mp3 names as the values.

edited Jul 28 '17 at 15:32

answered Jul 28 '17 at 12:55

ExtractTable.com

762
10
20

I have tried my best but i just cannot figure out how to do it. My new code is: for link in soup.find_all('d2p1:DownloadUrl'): print(link.text) I am very unsure if the tag is selected correctly. – Wi_Zeus Jul 28 '17 at 13:42
@Manuauto: the response is JSON (key-value pair) which means you have to extract the need (value) using the Key. I encourage you to try working on it. I believe you have given a try and posting the **Solution** which you need. Please check the updated response, above – ExtractTable.com Jul 28 '17 at 15:32
Thank you. This code works very well. I would not have been able to program it myself. Now I'll add further functionality to it, knowing how to expand it. Most importantly, how to get data in the first place. – Wi_Zeus Jul 28 '17 at 16:38

How do I filter out .mp3 links using beautifulsoup from (possibly) broken html? (JSON)

1 Answers1