4

On the site, there are a couple of links at the top labeled 1, 2, 3, and next. If a link labeled by a number is pressed, it dynamically loads in some data into a content div. If next is pressed, it goes to a page with labels 4, 5, 6, next and the data for page 4 is shown.

I want to scrape the data from the content div for all links pressed (I don't know how many there are, it just shows 3 at a time and next)

Please give an example of how to do it. For instance, consider the site www.cnet.com.

Please guide me to download the series of pages using selenium and parse them to handle with beautiful soup on my own.

Let Me Tink About It
  • 15,156
  • 21
  • 98
  • 207
Koushik
  • 372
  • 1
  • 2
  • 13

1 Answers1

11

General layout (not tested):

#!/usr/bin/env python
from contextlib import closing
from selenium.webdriver import Firefox # pip install selenium

url = "http://example.com"

# use firefox to get page with javascript generated content
with closing(Firefox()) as browser:
    n = 1
    while n < 10:
        browser.get(url) # load page
        link = browser.find_element_by_link_text(str(n))
        while link:
           browser.get(link.get_attribute("href")) # get individual 1,2,3,4 pages
           #### save(browser.page_source)
           browser.back() # return to page that has 1,2,3,next -like links
           n += 1
           link = browser.find_element_by_link_text(str(n))

        link = browser.find_element_by_link_text("next")
        if not link: break
        url = link.get_attribute("href")
jfs
  • 399,953
  • 195
  • 994
  • 1,670
  • The post was helpfull but i need to find the element by the class name . – Koushik Dec 28 '11 at 11:15
  • @user1118534: [update your question](http://stackoverflow.com/posts/8650999/edit) and specify what `links at the top labeled "1", "2", "3", and "next"` means in your case (if you're unsure then just post the html of the link: `...`). You could use `browser.find_element_by_class_name(classname)` to find an element by its class name. – jfs Dec 28 '11 at 12:47
  • am learning to scrape web sites that use java script as a part of learning currently i would like to scrape the editor reviews and user reviews for all the HP laptops in the website www.cnet.com. follow the steps to go to the desired page. go to www.cnet.com then click on reviews and then go to laptops and then view all brands. select the HP check box and for each laptop in all the pages like 1,2,3,4,.... on the top scraping the editor and user reviews is my goal. i would be very gratful if you can guide me in doing this – Koushik Dec 31 '11 at 06:27
  • @koushik: 1. make sure that their TOS allows such use. 2. to go to 3rd page you could use: [`link = browser.find_element_by_link_text("3"); link.click()`](https://gist.github.com/f494384edecc1a6952e0). To get reviews save `browser.page_source` for each 1,2,3,4,5, etc pages and parse them for links later. 3. It might be simpler just to use RSS or API instead of scraping if available. – jfs Dec 31 '11 at 08:32
  • thank you very much. i will try this out and if i have any thing else to ask i will get back to you. thank you very much – Koushik Jan 02 '12 at 15:09
  • i could'nt extend this to parse the pages can u please hep me in parsing the downloaded pages to get the desired information – Koushik Jan 10 '12 at 01:50
  • @koushik: [update your question](http://stackoverflow.com/posts/8650999/edit) or [ask a new one](http://stackoverflow.com/questions/ask): explain what have you tried and what doesn't work; be specific. – jfs Jan 10 '12 at 10:08
  • The code which you gave me was working as i was able to download the needed web pages.as in the above post where i explained you whats my goal i tried to extend your code by trying to parse the downloaded web pages to scrape the user reviews and editor review. but i was not able to do that as i was not finding a source to do that. can you please extend the previous code to scape out the data. and using this sample project i can continue to work on my main project. thank you. waiting for your reply. thank you very much in advance. – Koushik Jan 10 '12 at 22:37
  • @koushik: start by doing the necessary steps by hand. *Write down the sequence in plain English.* Try to translate each step to code. Break down a step into a sequence of smaller steps if you don't know how to translate the whole step at once. Ask a question if you're stuck. [here's an essay on how to ask questions in general](http://www.catb.org/~esr/faqs/smart-questions.html). – jfs Jan 11 '12 at 02:43