1

I am trying to grab values from a web page. My python code currently looks like this...

from lxml import html
import requests

if __name__ == "__main__":
    page = requests.get('https://www.example.com/example')
    tree = html.fromstring(page.content)
    print(tree.xpath('//div[@class="previous-crashes"]/text()'))

Here is an example of the html I am trying to get. So, in theory, I want a list that contains 12.54x, 5x, 1.06x, 12.54x, 1.93x. With the current code it always prints an empty list.

  • You may want to post the original url so we can test. I've posted my answer below without testing because I don't have the actual url, If it helped you, please consider accepting it as the correct answer, thanks! – Pedro Lobito Apr 25 '20 at 22:29
  • Thank you for the post and comment. I have come across more knowledge I did not previously know. Your code you provided did work for grabbing some contents, however, the page I am scraping is dynamic and uses java script so I will have to use selenium to actually retrieve the data I am looking for. Thanks for the help! – Tony Alexander Apr 25 '20 at 22:58

3 Answers3

0

I am not entirely sure but probably the website has some anti-scraping measures and thus you are returned with an empty file.

  • If there are some anti-scraping measures at all and they are causing the problem, in most cases, they can be bypassed by just changing User-Agent. https://stackoverflow.com/questions/10606133/sending-user-agent-using-requests-library-in-python – Demian Wolf Apr 25 '20 at 23:10
  • Though, it looks like this particular question's problem is not caused by anti-scraping measures. – Demian Wolf Apr 25 '20 at 23:12
  • 1
    Anti-scraping measures that can be bypassed by changing UA are very basic ones, there are much more advanced ones, try scraping Google lol – Nothingless Apr 25 '20 at 23:46
  • Of course. But I said **in most cases**, not in *all* cases. – Demian Wolf Apr 25 '20 at 23:55
0

You can try:

from bs4 import BeautifulSoup
import requests

req = requests.get("https://domain.tld")
soup = BeautifulSoup(req.text, 'html')
pointers = soup.findall("span", {"class": "pointer"})
for pointer in pointers:
    print(pointer.text)
Pedro Lobito
  • 94,083
  • 31
  • 258
  • 268
  • Yes, this works. This solves my issue of grabbing the html. Thank you. – Tony Alexander Apr 25 '20 at 22:59
  • You're welcome! If my answer helped you, please consider accepting it as the correct answer, thanks! I'm monitoring the selenium tag and I'll try to help if you open a new question with that tag. – Pedro Lobito Apr 25 '20 at 23:02
0
from lxml import html
import requests

page = requests.get('https://www.example.com/')
doc = html.fromstring(page.content)

elements = doc.find_class('previous-crashes')
for el in elements:
    pointers = el.find_class('pointer')
    for pointer in pointers:
        print(pointer.text_content())

This will give you the span text values from the HTML image you linked.

margusholland
  • 3,306
  • 1
  • 23
  • 30