0

I am having trouble scraping info from the url http://csgo-stats.com/epsilon-/ but due to the way the website handles things BeautifulSoup is only collecting data from the root page, aka http://csgo-stats.com

Is there a redirect going on thats tripping up BS? I can see in the html that BS outputs that it's trying to load my data but BS captures it too quickly:

<main class="site-content" id="content">
        <div class="loading-spinner" data-request="epsilon-" id="load">
            Loading
        </div>

Here is the code I'm working with just in case it's needed:

from urllib.request import urlopen
from bs4 import BeautifulSoup

url = "http://csgo-stats.com/Epsilon-/"
soup = BeautifulSoup(urlopen(url))
print(soup.prettify())
smci
  • 32,567
  • 20
  • 113
  • 146
  • Try this one which emulates a browser and should execute the javascript perfectly: http://phantomjs.org/ – tim Jan 26 '17 at 19:17
  • Or you could just [use the Steam API directly](http://stackoverflow.com/q/27752856/344286) – Wayne Werner Jan 26 '17 at 19:22
  • Just so you know, there is no need to edit thanks into your question after you've received an answer. If you have discovered something substantive that is not covered by an existing answer, you are most welcome to create a new answer of your own. – halfer Jan 30 '17 at 20:05

2 Answers2

1
smci
  • 32,567
  • 20
  • 113
  • 146
petr
  • 1,099
  • 1
  • 10
  • 23
  • I actually wasn't aware of the steam API to be honest. I'll bypass my entire problem by using this. Thanks for letting me know! I chose your answer as the solution as its the easiest and exactly what I need without any extra hassle. Thanks! – Isaiah Feldt Jan 26 '17 at 20:00
0

While most of the http content libraries (beautiful soup, requests,...) would get you the page source this is not how the page looks once it renders in the browser. This has to do with how the HTML code is built today and that is because much of the page rendering happens later when all JavaScript on the page does it's work. This is exactly why you do not see the 'final' content.

Now, if you wish to collect the content in a way how browser renders it after all JavaScript music is played, then you need another kind of (python) library and that library is Selenium.

More on Selenium on: http://www.seleniumhq.org/

Just to warn you that selenium is pretty large beast with a lot of hairy ends, but learning it is worthwhile (not only for scraping)

ljgww
  • 83
  • 9