-12

I am trying to parse the website "https://ih.advfn.com/stock-market/NYSE/gen-electric-GE/stock-price" and extract its most recent messages from its board. It is bot protected with Cloud-flare. I am using python and its relative libraries and this is what I have so far

from bs4 import BeautifulSoup as soup #parses/cuts  the html
import cfscrape
import requests
url = 'https://ih.advfn.com/stock-market/NYSE/gen-electric-GE/stock- 
price'

r=requests.get(url)
html = soup(r.text, "html.parser")
containers = html.find("div",{"id":"bbPosts"})
print(containers.text.strip())

I am not able to use the html parser because the site detects and blocks my script then. My questions are: How can I parse the web pages to pull the table data?

Might I mention that this is for a security class I am taking. I am not using this for malicious reasons.

Vlad Bogza
  • 101
  • 2
  • 2
  • 11
  • 7
    you don't. or rather you shouldn't. If a site is putting effort into anti scrape bots, it means they don't want people scraping their data. If they don't offer an API don't scrape their site without their consense. As for your question, asking for a tutorial is off topic on this site. – MooingRawr Apr 09 '18 at 19:10
  • 1
    How do you know it blocks your script? How do you know it's not working? It's probably running some javascript. You'll need something like [Selenium](https://selenium-python.readthedocs.io/). – Peter Wood Apr 09 '18 at 19:20
  • I am not using this for any sort of project. I am using this for practice. – Vlad Bogza Apr 09 '18 at 20:22
  • I want to be familiar with the concepts of bot detection and prevention. @MooingRawr – Vlad Bogza Apr 09 '18 at 20:23
  • It seems like they are using Angular's Data binding, I would suggest you to try a different approach, like taking a snapshot of the website [link](https://stackoverflow.com/questions/1197172/how-can-i-take-a-screenshot-image-of-a-website-using-python) –  Apr 27 '18 at 05:17

1 Answers1

0

There are multiple ways of bypassing the site protection. You have to see exactly how they are blocking you.

One common way of blocking requests is to look at the User Agent header. The client ( in your case the requests library ) will inform the server about it's identity.

Generally speaking, a browser will say I am a browser and a library will say I am a library. The server can then say I allow browsers but not libraries to access my content.

However, for this particular case, you can simply lie to the server by sending your own User Agent header.

You can see a example here. Try to use your browsers user agent.

Other blocking techniques include ip ranges. One way to bypass this is via a vpn. This is one of the easiest vpns to set up. Just spin up a machine on amazon and get this container running.

What else could happen, you might try to access a single page application that is not rendered server side. In this case, what you should receive with that get requests is a very small html file that essentially references a javascript file. If this is the case, what you need is a actual browser that you control programatically. I would suggest you look at Google Chrome Headless however there are others. You can also use Selenium

Web crawling is a beautiful but very deep subject. I think these pointers should set you on the right direction.


Also, as a quick mention, my advice is to avoid from bs4 import BeautifulSoup as soup. I would recommend html2text

mayk93
  • 1,489
  • 3
  • 17
  • 31