0

So here's my problem. I'm trying to use lxml to web scrape a website and get some information but the elements that the information pertains to aren't being found when using the var.xpath command. It's finding the page but after using the xpath it doesn't find anything.

import requests
from lxml import html

def main():
   result = requests.get('https://rocketleague.tracker.network/rocket-league/profile/xbl/ReedyOrange/overview')

   # the root of the tracker website
   page = html.fromstring(result.content)
   print('its getting the element from here', page)
   
   threesRank = page.xpath('//*[@id="app"]/div[2]/div[2]/div/main/div[2]/div[3]/div[1]/div/div/div[1]/div[2]/table/tbody/tr[*]/td[3]/div/div[2]/div[1]/div')
   print('the 3s rank is: ', threesRank)

if __name__ == "__main__":
    main()

OUTPUT:
"D:\Python projects\venv\Scripts\python.exe" "D:/Python projects/main.py"

its getting the element from here <Element html at 0x20eb01006d0>
the 3s rank is:  []

Process finished with exit code 0

The output next to "the 3s rank is:" should look something like this

[<Element html at 0x20eb01006d0>, <Element html at 0x20eb01006d0>, <Element html at 0x20eb01006d0>]


2 Answers2

0

Because the xpath string does not match, no result set is returned by page.xpath(..). It's difficult to say exactly what you are looking for but considering "threesRank" I assume you are looking for all the table values, ie. ranking and so on.

You can get a more accurate and self-explanatory xpath using the Chrome Addon "Xpath helper". Usage: enter the site and activate the extension. Hold down the shift key and hoover on the element you are interested in.

Since the HTML used by tracker.network.com is built dynamically using javascript with BootstrapVue (and Moment/Typeahead/jQuery) there is a big risk the dynamic rendering is producing different results from time to time.

Instead of scraping the rendered html, I suggest you instead use the structured data needed for the rendering, which in this case is stored as json in a JavaScript variable called __INITIAL_STATE__

import requests
import re
import json
from contextlib import suppress

# get page
result = requests.get('https://rocketleague.tracker.network/rocket-league/profile/xbl/ReedyOrange/overview')

# Extract everything needed to render the current page. Data is stored as Json in the
# JavaScript variable: window.__INITIAL_STATE__={"route":{"path":"\u0 ... }};
json_string = re.search(r"window.__INITIAL_STATE__\s?=\s?(\{.*?\});", result.text).group(1)

# convert text string to structured json data
rocketleague = json.loads(json_string)

# Save structured json data to a text file that helps you orient yourself and pick
# the parts you are interested in.
with open('rocketleague_json_data.txt', 'w') as outfile:
    outfile.write(json.dumps(rocketleague, indent=4, sort_keys=True))

# Access members using names
print(rocketleague['titles']['currentTitle']['platforms'][0]['name'])

# To avoid 'KeyError' when a key is missing or index is out of range, use "with suppress"
# as in the example below:  since there there is no platform no 99, the variable "platform99"
# will be unassigned without throwing a 'keyerror' exception.
from contextlib import suppress

with suppress(KeyError):
    platform1 = rocketleague['titles']['currentTitle']['platforms'][0]['name']
    platform99 = rocketleague['titles']['currentTitle']['platforms'][99]['name']

# print platforms used by currentTitle
for platform in rocketleague['titles']['currentTitle']['platforms']:
    print(platform['name'])

# print all titles with corresponding platforms
for title in rocketleague['titles']['titles']:
    print(f"\nTitle: {title['name']}")
    for platform in title['platforms']:
        print(f"\tPlatform: {platform['name']}")
IODEV
  • 1,706
  • 2
  • 17
  • 20
  • Thanks using that helps a lot, i actually was getting some variation in results from the site which is why i was trying to use lxml. I'm actually trying to the different ranks of the site. how would i path it to get to the actual rank values? im not to familiar with pathing. – Andrew Prince Apr 16 '21 at 21:21
  • okay, so i got everything to work and after a little less than a week it worked but just now im getting this error where i cant see the ranks anymore. the field seems to be empty. in fact, in the whole json txt none of the ranks are showing up. what do i do? – Andrew Prince Apr 21 '21 at 21:49
  • The website is currently offline: `This page (https://rocketleague.tracker.network/) is currently offline. However, because the site uses Cloudflare's Always Online™ technology you can continue to surf a snapshot of the site. We will keep checking in the background and, as soon as the site comes back, you will automatically be served the live version.` – IODEV Apr 22 '21 at 05:45
  • I recommend you check of the status when calling `result = requests.get(...)` like for example: `if result.status_code != 200: print("site is down") ...` – IODEV Apr 22 '21 at 06:00
-1

lxml doesn't support "tbody". change your xpath to

'//*[@id="app"]/div[2]/div[2]/div/main/div[2]/div[3]/div[1]/div/div/div[1]/div[2]/table/tr[*]/td[3]/div/div[2]/div[1]/div'
Cloud11665
  • 94
  • 1
  • 6
  • lxml can handle any kind of tag or attribute including tbody. There may be several different problems involved here,: 1) the site **rocketleague.tracker.network** produces bad html like massive amount of duplicate attribute names etc (check [validator.w3.org](https://validator.w3.org/nu/?doc=https%3A%2F%2Frocketleague.tracker.network%2Frocket-league%2Fprofile%2Fxbl%2FReedyOrange%2Foverview)) 2) html is dynamically generated using BootstrapVue 3) browsers sometimes insert tbody element into a table: https://stackoverflow.com/questions/938083 – IODEV Apr 16 '21 at 14:02