urllib.request + BeautifulSoup cannot scrape certain page, instead scrape root page

Question

I am having trouble scraping info from the url http://csgo-stats.com/epsilon-/ but due to the way the website handles things BeautifulSoup is only collecting data from the root page, aka http://csgo-stats.com

Is there a redirect going on thats tripping up BS? I can see in the html that BS outputs that it's trying to load my data but BS captures it too quickly:

<main class="site-content" id="content">
        <div class="loading-spinner" data-request="epsilon-" id="load">
            Loading
        </div>

Here is the code I'm working with just in case it's needed:

from urllib.request import urlopen
from bs4 import BeautifulSoup

url = "http://csgo-stats.com/Epsilon-/"
soup = BeautifulSoup(urlopen(url))
print(soup.prettify())

Try this one which emulates a browser and should execute the javascript perfectly: http://phantomjs.org/ — tim, Jan 26 '17 at 19:17
Or you could just [use the Steam API directly](http://stackoverflow.com/q/27752856/344286) — Wayne Werner, Jan 26 '17 at 19:22
Just so you know, there is no need to edit thanks into your question after you've received an answer. If you have discovered something substantive that is not covered by an existing answer, you are most welcome to create a new answer of your own. — halfer, Jan 30 '17 at 20:05

score 1 · Accepted Answer · edited Jan 15 '21 at 18:12

1

The problem is that urllib.request does not process Javascript. Try to visit the page with Javascript disabled. More on javascript-enabled scraping: Web-scraping JavaScript page with Python
It's always best to avoid scraping if API is provided (Getting CS:GO player stats)

edited Jan 15 '21 at 18:12

smci

32,567
20
113
146

answered Jan 26 '17 at 19:22

petr

1,099
1
10
23

I actually wasn't aware of the steam API to be honest. I'll bypass my entire problem by using this. Thanks for letting me know! I chose your answer as the solution as its the easiest and exactly what I need without any extra hassle. Thanks! – Isaiah Feldt Jan 26 '17 at 20:00

ljgww · Answer 2 · 2017-01-26T19:42:48.327

While most of the http content libraries (beautiful soup, requests,...) would get you the page source this is not how the page looks once it renders in the browser. This has to do with how the HTML code is built today and that is because much of the page rendering happens later when all JavaScript on the page does it's work. This is exactly why you do not see the 'final' content.

Now, if you wish to collect the content in a way how browser renders it after all JavaScript music is played, then you need another kind of (python) library and that library is Selenium.

More on Selenium on: http://www.seleniumhq.org/

Just to warn you that selenium is pretty large beast with a lot of hairy ends, but learning it is worthwhile (not only for scraping)

I will definitely look into this. Thank you – Isaiah Feldt Jan 26 '17 at 20:03 — Isaiah Feldt, Jan 26 '17 at 20:03

urllib.request + BeautifulSoup cannot scrape certain page, instead scrape root page

2 Answers2