17

I am learning to use Python Selenium and BeautifulSoup for web scraping. Currently, I am trying to scrape the hot searches on Google search trends http://www.google.com/trends/hottrends#pn=p5

This is my current code. However, I realized the full html is not downloaded and I only have content from the most recent few dates. What can I do to rectify this problem?

from selenium import webdriver
from bs4 import BeautifulSoup

googleURL = "http://www.google.com/trends/hottrends#pn=p5"

browser = webdriver.Firefox()
browser.get(googleURL)
content = browser.page_source

soup = BeautifulSoup(content)
print soup
user2392965
  • 435
  • 2
  • 4
  • 13
  • Any specific reasons for using webdrivers via selenium for this? – Torxed May 17 '13 at 08:00
  • @Torxed -- I suspect it is because of dynamic content / javascript handling... – root May 17 '13 at 08:06
  • yes i tried urllib2 but it did not work due to the dynamic content/javascript – user2392965 May 17 '13 at 08:20
  • Then perhaps mention that you've tried that and also that there's Javascript on the page you're running towards instead of giving me downvotes perhaps -.- You didn't mentioned anything besides the fact that your HTML wasn't downloading completely and urllib2 is more reliable to determain why data isn't downloaded (sockets even better) but you obviously know what your problem is so.. yea downvote seems unfair. – Torxed May 17 '13 at 08:28
  • 1
    @Torxed -- Well, OP did link the actual page. – root May 17 '13 at 08:32
  • 1
    @Torxed I'm quite new to StackOverflow and I did not give you the downvote – user2392965 May 17 '13 at 08:38
  • @root Yes, and Google's sourcecode for all their webpages are crowded with a shitstorm of stuff so i tend to swiff through it and see what the user has tried so far, and according to OP he havn't tried urllib2 and at the time of writing the answer the JavaScript problem wasn't known, despite that fact people tend to give downvotes for good answers before the OP describes the actual problem giving you no reason to post on semi-descent questions. Anyway, sorry to give you a non-conclusive answer user2392965, gl with your endavours. – Torxed May 17 '13 at 08:47
  • These days (and for the last ten years at least) I think it's safe to say that most websites have various protections against scraping without using live browser automation. Also this is a good time to remember that "downvotes aren't personal" – Darren Ringer Jul 28 '15 at 20:00

1 Answers1

13

Users add more content to the page (from previous dates) by clicking the <div onclick="control.moreData()" id="moreLink">More...</div> element at the bottom of the page.

So to get your desired content, you could use Selenium to click the id="moreLink" element or execute some JavaScript to call control.moreData(); in a loop.

For example, if you want to get all content as far back as Friday, February 15, 2013 (it looks like a string of this format exists for every date, for loaded content) your python might look something like this:

content = browser.page_source
desired_content_is_loaded = false;
while (desired_content_is_loaded == false):
     if not "Friday, February 15, 2013" in content:
          sel.run_script("control.moreData();")
          content = browser.page_source
     else:
          desired_content_is_loaded = true;

EDIT:

If you disable JavaScript in your browser and reload the page, you will see that there is no "trends" content at all. What that tells me, is that the those items are loaded dynamically. Meaning, they are not part of the HTML document which is downloaded when you open the page. Selenium's .get() waits for the HTML document to load, but not for all JS to complete. There's no telling if async JS will complete before or after any other event. It completes when it's ready, and could be different every time. That would explain why you might sometimes get all, some, or none of that content when you call browser.page_source because it depends how fast async JS happens to be working at that moment.

So, after opening the page, you might try waiting a few seconds before getting the source - giving the JS which loads the content time to complete.

browser.get(googleURL)
time.sleep(3)
content = browser.page_source
Dingredient
  • 2,191
  • 22
  • 47
  • thanks for the answer. However, the problem is I'm not even getting all the results from the first page e.g. I only get three records even when there are six on the initial screen. Btw, is there a way to automate the scrolling down instead of hard-coding the date Friday, Feb 15, 2013? Thanks. – user2392965 May 17 '13 at 23:42
  • 4
    instead of time.sleep i would look into selenium.webdriver.support.ui.WebDriverWait http://stackoverflow.com/questions/9823272/python-selenium-waiting-for-frame-element-lookups – qwwqwwq May 20 '13 at 20:56
  • I edited my answer to explain why you might be getting only some the results when there are more on the screen. – Dingredient Jun 07 '13 at 00:16
  • but how do you download (and save) this data? i am getting errors u'\xae' when i try to write as ascii file – user391339 Apr 03 '16 at 00:22
  • file io is a whole other topic, but in python its pretty simple. This guy's answer is nice and concise, for a basic example: http://stackoverflow.com/a/30021479/2386700 – Dingredient Apr 03 '16 at 03:09