Beautiful soup parsing web page

Question

I am trying to scrape the following web page: https://www.racingpost.com with BS. For example I want to extract all the Course names. Course names are under this tag:

<span class="rh-cardsMatrix__courseName">Wincanton</span>

My code is here:

from bs4 import BeautifulSoup
import requests
import pandas as pd
url = "https://www.racingpost.com"
response = requests.get(url)
data = response.text
soup =  BeautifulSoup(data, "html.parser")
pages = soup.find_all('span',{'class':'rh-cardsMatrix__courseName'})
for page in pages:
    print(page.text)

And I don't get anything for output. I think that it has some issues with parsing, and I have tried all available parsers for BS. Could someone advise here? Is it even possible to do with BS?

Does `soup.find_all('span')` return all the `span` elements? If so, your filter needs work. You can pass a function as the filter if you want. — JDunken, Mar 05 '20 at 17:58
soup.find_all('span') returns some span elements, but they are not what I see from "Inspect element" option — Andrija_Grozdanovic, Mar 05 '20 at 18:02
If your matter is solved please mark an answer as accepted so that others can see that your question has been answered. — petezurich, Mar 06 '20 at 06:20

petezurich · Answer 1 · 2020-03-05T21:10:31.110

The data you are looking for seems to be hidden in a script block at the end of the raw HTML.

You can try something like this:

import requests
from bs4 import BeautifulSoup
import json
import pandas as pd
from pandas import json_normalize

url = 'https://www.racingpost.com'
res = requests.get(url).text

raw = res.split('cardsMatrix":{"courses":')[1].split(',"date":"2020-03-06","heading":"Tomorrow\'s races"')[0]
data = json.loads(raw)
df = json_normalize(data)

Output:

id  abandoned   allWeather  surfaceType     colour  name    countryCode     meetingUrl  hashName    meetingTypeCode     races
0   1083    False   True    Polytrack   3   Chelmsford  GB  /racecards/1083/chelmsford-aw/2020-03-06    chelmsford-aw   Flat    [{'id': 753047, 'abandoned': False, 'result': ...
1   1212    False   False       4   Ffos Las    GB  /racecards/1212/ffos-las/2020-03-06     ffos-las    Jumps   [{'id': 750498, 'abandoned': False, 'result': ...
2   1138    False   True    Polytrack   11  Dundalk     IRE     /racecards/1138/dundalk-aw/2020-03-06   dundalk-aw  Flat    [{'id': 753023, 'abandoned': False, 'result': ...
3   513     False   True    Tapeta  5   Wolverhampton   GB  /racecards/513/wolverhampton-aw/2020-03-06  wolverhampton-aw    Flat    [{'id': 750658, 'abandoned': False, 'result': ...
4   565     False   False       0   Jebel Ali   UAE     /racecards/565/jebel-ali/2020-03-06     jebel-ali   Flat    [{'id': 753155, 'abandoned': False, 'result': ...
5   206     False   False       0   Deauville   FR  /racecards/206/deauville/2020-03-06     deauville   Flat    [{'id': 753186, 'abandoned': False, 'result': ...
6   54  True    False       1   Sandown     GB  /racecards/54/sandown/2020-03-06    sandown     Jumps   [{'id': 750510, 'abandoned': True, 'result': F...
7   30  True    False       2   Leicester   GB  /racecards/30/leicester/2020-03-06  leicester   Jumps   [{'id': 750501, 'abandoned': True, 'result': F...

Caveat: Be aware that you have to manually search for the string to properly split res at the end.

Edit: More robust solution.

To get the script block in total and parse from there try this code:

url = 'https://www.racingpost.com'
res = requests.get(url).content
soup = BeautifulSoup(res)

# salient data seems to be in 20th script block 
data = soup.find_all("script")[19].text
clean = data.split('window.__PRELOADED_STATE = ')[1].split(";\n")[0]
clean = json.loads(clean)
clean.keys()

Output:

['stories', 'bookmakers', 'panelTemplate', 'cardsMatrix', 'advertisement']

Then retrieve e.g. data saved to key cardsMatrix:

parsed = json_normalize(clean["cardsMatrix"]).courses.values[0]
pd.DataFrame(parsed)

Output again the above (but with more robust solution):

id  abandoned   allWeather  surfaceType     colour  name    countryCode     meetingUrl  hashName    meetingTypeCode     races
0   1083    False   True    Polytrack   3   Chelmsford  GB  /racecards/1083/chelmsford-aw/2020-03-06    chelmsford-aw   Flat    [{'id': 753047, 'abandoned': False, 'result': ...
1   1212    False   False       4   Ffos Las    GB  /racecards/1212/ffos-las/2020-03-06     ffos-las    Jumps   [{'id': 750498, 'abandoned': False, 'result': ...

score 0 · Answer 2 · answered Mar 05 '20 at 17:59

Viewing the source code of https://www.racingpost.com, no elements have the classname rh-cardsMatrix__courseName. Querying for it on the page shows that it does exist when the page is rendered. This suggests that the elements with that classname are generated with JavaScript, which BeautifulSoup doesn't support (it doesn't run JavaScript).

You'll instead want to find the endpoints on the webpage that return the data that create those elements (e.g., look for XHRs for data) and use those to get the data that you need.

score 0 · Answer 3 · answered Mar 06 '20 at 06:58

Thanks mattbasta for your answer, it directed me to this question which solved my problems : soup = BeautifulSoup(data, "html.parser") pages = soup.find_all('span',{'class':'rh-cardsMatrix__courseName'})

PyQt4 to PyQt5 -> mainFrame() deprecated, need fix to load web pages

Beautiful soup parsing web page

3 Answers3