Scrape a web page's contents using Python/selenium

Question

I'm trying to scrape the contents of a table. I believe the table is rendered in JavaScript, so I'm using the selenium package and Python3. To do such a task, I've seen others find the tables xpath in order to scrape its contents, but I'm just not sure how to identify the correct xpath.

How can I extract the tables contents? If using a xpath, how do I identify the correct xpath(s) corresponding to the table or its contents by inspecting the web page's source?

from selenium import webdriver                                                                                                                                                                                                                                              
driver = webdriver.Chrome('path/to/chromedriver.exe')                                      
url = https://ultrasignup.com/results_event.aspx?did=6727
driver.get(url)

# Now I need to get the tables contents. I might do something like this:
table = driver.find_elements_by_xpath('my_xpath')
table_html = table.get_attribute('innerHTML') # not sure what innerHTML is...
df = read_html(table_html)[0]
print(df)
driver.close()

The page-under-test has many page elements with `id` attributes. Locating via `id` will be less fragile; YMMV. — orde, Jun 23 '19 at 18:33
I believe there is no need to scrape, because they have an API. If you visit this link you will see nicely formatted data from the table you provided: https://ultrasignup.com/service/events.svc/results/6727/json?rows=1500 — andreilozhkin, Jun 23 '19 at 18:28
@andreilozhkin you began to post some code that looked helpful, but then removed it. I could accept your answer if you put it back up! — twb10, Jun 23 '19 at 19:27

score 1 · Accepted Answer · answered Jun 23 '19 at 19:02

I believe there is no need to scrape, because they have an API.

If you visit this link you will see nicely formatted data from the table you provided: https://ultrasignup.com/service/events.svc/results/6727/json

Some code:

import json, requests

url = 'https://ultrasignup.com/service/events.svc/results/6727/json'

response = requests.get(url)

# Get all people from the table
people = [x for x in response.json()] 

# Print first person's information
print(people[0])

Hope it helps!

score 0 · Answer 2 · answered Jun 23 '19 at 18:59

You can identify the correct xpath by inspecting the elements of the table and seeing the source code. After you see in which tags is the table content present you have to make your xpath step-wise.

For example:


<div class="test">
<p class="test2">
<table class="test3"> 
<!--May have more attributes-->
contents...
</table>
</p>
</div>

Then you begin your xpath with //div[@class="test"] Now you are inside div,

Next step: //div[@class="test"]//p[@class="test2"] Now you are inside paragraph tag

Final Step:

xpath = "//div[@class='test']//p[@class='test2']//table[@class='test3']"

table = driver.find_elements_by_xpath('xpath')

Now you can access table and get whatever attributes you want or even the table contents

Thanks YOGOVO, this begins to help me better understand the structure of the html source code. Would you be able to identify examples xpaths based on the webpage example I provided? I am still struggling to identify the correct tags from the source code. — twb10, Jun 23 '19 at 19:09

Scrape a web page's contents using Python/selenium

2 Answers2