How do I grab html elements when web scraping?

Question

I am trying to grab values from a web page. My python code currently looks like this...

from lxml import html
import requests

if __name__ == "__main__":
    page = requests.get('https://www.example.com/example')
    tree = html.fromstring(page.content)
    print(tree.xpath('//div[@class="previous-crashes"]/text()'))

Here is an example of the html I am trying to get. So, in theory, I want a list that contains 12.54x, 5x, 1.06x, 12.54x, 1.93x. With the current code it always prints an empty list.

You may want to post the original url so we can test. I've posted my answer below without testing because I don't have the actual url, If it helped you, please consider accepting it as the correct answer, thanks! — Pedro Lobito, Apr 25 '20 at 22:29
Thank you for the post and comment. I have come across more knowledge I did not previously know. Your code you provided did work for grabbing some contents, however, the page I am scraping is dynamic and uses java script so I will have to use selenium to actually retrieve the data I am looking for. Thanks for the help! — Tony Alexander, Apr 25 '20 at 22:58

score 0 · Answer 1 · answered Apr 25 '20 at 22:22

0

I am not entirely sure but probably the website has some anti-scraping measures and thus you are returned with an empty file.

answered Apr 25 '20 at 22:22

Nothingless

3
5

If there are some anti-scraping measures at all and they are causing the problem, in most cases, they can be bypassed by just changing User-Agent. https://stackoverflow.com/questions/10606133/sending-user-agent-using-requests-library-in-python – Demian Wolf Apr 25 '20 at 23:10
Though, it looks like this particular question's problem is not caused by anti-scraping measures. – Demian Wolf Apr 25 '20 at 23:12
1

Anti-scraping measures that can be bypassed by changing UA are very basic ones, there are much more advanced ones, try scraping Google lol – Nothingless Apr 25 '20 at 23:46
Of course. But I said **in most cases**, not in *all* cases. – Demian Wolf Apr 25 '20 at 23:55

score 0 · Accepted Answer · answered Apr 25 '20 at 22:33

0

You can try:

from bs4 import BeautifulSoup
import requests

req = requests.get("https://domain.tld")
soup = BeautifulSoup(req.text, 'html')
pointers = soup.findall("span", {"class": "pointer"})
for pointer in pointers:
    print(pointer.text)

answered Apr 25 '20 at 22:33

Pedro Lobito

94,083
31
258
268

Yes, this works. This solves my issue of grabbing the html. Thank you. – Tony Alexander Apr 25 '20 at 22:59
You're welcome! If my answer helped you, please consider accepting it as the correct answer, thanks! I'm monitoring the selenium tag and I'll try to help if you open a new question with that tag. – Pedro Lobito Apr 25 '20 at 23:02

margusholland · Answer 3 · 2020-04-25T23:02:32.960

0

from lxml import html
import requests

page = requests.get('https://www.example.com/')
doc = html.fromstring(page.content)

elements = doc.find_class('previous-crashes')
for el in elements:
    pointers = el.find_class('pointer')
    for pointer in pointers:
        print(pointer.text_content())

This will give you the span text values from the HTML image you linked.

edited Apr 25 '20 at 23:02

answered Apr 25 '20 at 22:55

margusholland

3,306
1
23
30

How do I grab html elements when web scraping?

3 Answers3