1

I'm trying to scrape this website: http://www.fivb.org/EN/BeachVolleyball/PlayersRanking_W.asp, but this page loads the contents of the table (probably through AJAX), after the page has been loaded.

My attempt:

import requests
from bs4 import BeautifulSoup, Comment
uri = 'http://www.fivb.org/EN/BeachVolleyball/PlayersRanking_W.asp'

r = requests.get(uri)
soup = BeautifulSoup(r.content) 
print(soup)

But the div with the id='BTechPlayM' remains empty, regardless of what I do. I've tried:

  • Setting a timeout on the request: requests.get(uri, timeout=10)
  • Passing headers
  • Using eventlet, to set a delay
  • And the latest thing was to try and use the selenium-library, to use PhantomJS (installed from NPM), but this rabbit-whole just kept going deeper and deeper.

Are there a way to send a request to a URI, wait X seconds, and return the contents then?

... Or to send a request to a URI, keep checking if a div contains an element; and only return the contents, whenever it does?

Keyur Potdar
  • 7,158
  • 6
  • 25
  • 40
Zeth
  • 2,273
  • 4
  • 43
  • 91

2 Answers2

1

Short answer: No. You cannot do that using requests.

But, as you said, the table data is generated dynamically using JavaScript. The data is obtained from this URL. But, the response is not in JSON format; it's JavaScript. So, from that data, you can get the required data which is available in lists using RegEx.

But, again, the data returned by RegEx is in string format and not an actual list. You can convert this string to a list using ast.literal_eval(). For example, the data looks like this:

'["1", "Humana-Paredes", "CAN", "4", "1,720", ""]'

Complete code:

import re
import requests
import ast

r = requests.get('http://www.fivb.org/Vis/Public/JS/Beach/TechPlayRank.aspx?Gender=1&id=BTechPlayW&Date=20180326')
data = re.findall(r'(\[[^[\]]*])', r.text)
for player in data:
    details = ast.literal_eval(player)
    print(details)  # this var is a list (format shown below)

Partial output:

['1', 'Humana-Paredes', 'CAN', '4', '1,720', '']
['', 'Pavan', 'CAN', '4', '1,720', '']
['3', 'Talita', 'BRA', '4', '1,660', '']
['', 'Larissa', 'BRA', '4', '1,660', '']
['5', 'Hermannova', 'CZE', '4', '1,360', '']
['', 'Slukova', 'CZE', '4', '1,360', '']
['7', 'Laboureur', 'GER', '4', '1,340', '']
...

The basic format of this list (details) is:

[<Rank>, <Name>, <Country>, <Nb. part.>, <Points>, <Entry pts.>]

You can utilize this data however you want. For example, using details[1] will give you all the names.

Keyur Potdar
  • 7,158
  • 6
  • 25
  • 40
  • 1
    @Zeth, as an alternative, you can have a look at the [requests_html](https://pypi.python.org/pypi/requests-html/0.8.0) library, which is available in Python 3.6 and higher. – Keyur Potdar Mar 31 '18 at 13:33
0

You can use selenium, as requests doesn't give option to wait-

from selenium import webdriver
from selenium.webdriver.common.by import By
from selenium.webdriver.common.keys import Keys

from bs4 import BeautifulSoup, Comment
uri = 'http://www.fivb.org/EN/BeachVolleyball/PlayersRanking_W.asp'

browser = webdriver.Chrome("./chromedriver") #download chromebrowser
browser.set_page_load_timeout(60)
browser.get(uri) #open page in browser
text = browser.page_source
browser.quit()

soup = BeautifulSoup(text) 
print(soup)

You will have to download chromedriver

Aritesh
  • 1,985
  • 1
  • 13
  • 17
  • Unless we're talking Facebook-like sites it's usually fairly easy to reverse engineer sites that load data with ajax. The data you seek can be retrived with requests from this url http://www.fivb.org/Vis/Public/JS/Beach/TechPlayRank.aspx?Gender=1&id=BTechPlayW&Date=20180326 – jlaur Mar 31 '18 at 11:45
  • Btw since you're on Python 3 you should use r.text, not r.content. the first returns a str - which is required by bs4 - the latter bytes... – jlaur Mar 31 '18 at 11:47