I'm trying to scrape a website with infinite scrolling

Question

This I have tried in R, but I was unable to do it for infinite scrolling,

This is the reference link to get some idea about infinite scrolling using Selenium package in Pyhton. I'm quite noob in Python coding but still tried some editing from the reference post.

Here is the code for scraping in R

library(rvest)
 uuu_df2 <- data.frame(x = c('http://www.magicbricks.com/property-for-
 sale/residential-real-estate?bedroom=1&proptype=Multistorey-Apartment,Builder-
 Floor-Apartment,Penthouse,Studio-Apartment&cityName=Thane&BudgetMin=5-
 Lacs&BudgetMax=5-Lacs',
                            'http://www.magicbricks.com/property-for-sale/residential-real-estate?bedroom=1&proptype=Multistorey-Apartment,Builder-Floor-Apartment,Penthouse,Studio-Apartment&cityName=Thane&BudgetMin=5-Lacs&BudgetMax=10-Lacs',
'http://www.magicbricks.com/property-for-sale/residential-real-estate?bedroom=1&proptype=Multistorey-Apartment,Builder-Floor-Apartment,Penthouse,Studio-Apartment&cityName=Thane&BudgetMin=5-Lacs&BudgetMax=10-Lacs'))

    urlList <- llply(uuu_df2[,1], function(url){     

      this_pg <- read_html(url)

      results_count <- this_pg %>% 
        xml_find_first(".//span[@id='resultCount']") %>% 
        xml_text() %>%
        as.integer()

      if(!is.na(results_count) & (results_count > 0)){

        cards <- this_pg %>% 
          xml_find_all('//div[@class="SRCard"]')

        df <- ldply(cards, .fun=function(x){
          y <- data.frame(wine = x %>% xml_find_first('.//span[@class="agentNameh"]') %>% xml_text(),
                          excerpt = x %>% xml_find_first('.//div[@class="postedOn"]') %>% xml_text(),
                          locality = x %>% xml_find_first('.//span[@class="localityFirst"]') %>% xml_text(),
                          society = x %>% xml_find_first('.//div[@class="labValu"]') %>% xml_text() %>% gsub('\\n', '', .))
          return(y)
        })

      } else {
        df <- NULL
      }

      return(df)   
    }, .progress = 'text')
    names(urlList) <- uuu_df2[,1]

And this is python code for infinite scrolling which I tried to edit from original post

from selenium import webdriver
from selenium.webdriver.common.by import By
from selenium.webdriver.common.keys import Keys
from selenium.webdriver.support.ui import Select
from selenium.webdriver.support.ui import WebDriverWait
from selenium.common.exceptions import TimeoutException
from selenium.webdriver.support import expected_conditions as EC
from selenium.common.exceptions import NoSuchElementException
from selenium.common.exceptions import NoAlertPresentException
import sys

import unittest, time, re

class Sel(unittest.TestCase):
    def setUp(self):
        self.driver = webdriver.Firefox()
        self.driver.implicitly_wait(30)
        self.base_url = "http://www.magicbricks.com/property-for-sale/residential-real-estate?bedroom=1&proptype=Multistorey-Apartment,Builder-Floor-Apartment,Penthouse,Studio-Apartment&cityName=Thane&BudgetMin=5-Lacs&BudgetMax=10-Lacs"
        self.verificationErrors = []
        self.accept_next_alert = True
    def test_sel(self):
        driver = self.driver
        delay = 3
        driver.get(self.base_url + "/search?q=stckoverflow&src=typd")
        driver.find_element_by_link_text("All").click()
        for i in range(1,html_text(html_node(read_html(self.base_url,'a.act')))): #dummy code line to get the number of pages till it should loop till
            self.driver.execute_script(".//span[@class=agentNameh;")
            time.sleep(4)
        html_source = driver.page_source
        data = html_source.encode('utf-8')


if __name__ == "__main__":
    unittest.main()

But it gives me error:

execfile(filename, namespace)
  File "C:\Users\user\Anaconda3\lib\site-packages\spyder\utils\site\sitecustomize.py", line 102, in execfile
    exec(compile(f.read(), filename, 'exec'), namespace)
  File "D:/Deepesh/All files/test_forCSVData.py", line 27
    self.driver.execute_script(".//span[@class="agentNameh;")

Any suggestions on what edit should be made in my Python/R code so that it would scroll infinite .Any help would be much appreciated.

`self.driver.execute_script(".//span[@class="agentNameh;")` Somethings not right with the argument in this function. you have three `"` in there, I suspect you're missing a backslash to escape the center one (or pass a raw string). Or perhaps you're trying to concatenate? In that case you're missing a `+` (and another `"`) — Zinki, Jun 19 '17 at 12:14
What do you think this `".//span[@class=agentNameh;"` *"JavaScript"* should do? What do you expect? of course if you write it correctly like `".//span[@class='agentNameh'];"` or `".//span[@class=agentNameh]"` or whatever... — Andersson, Jun 19 '17 at 12:20
@Andersson should give me name of the agent, which is under the property photo. — Andre_k, Jun 19 '17 at 12:23
@deepesh, if you want to get text value of element with `JavaScript` you should use something like `'return document.querySelector("span.agentName").childNodes[0].textContent'` — Andersson, Jun 19 '17 at 12:36
Gave me error : exec(compile(f.read(), filename, 'exec'), namespace) File "D:/Deepesh/All files/test_forCSVData.py", line 27 return document.querySelector("span.agentName").childNodes[0].textC‌ontent ^ SyntaxError: invalid character in identifier — Andre_k, Jun 19 '17 at 12:38
try to re-write it manually instead of copy/paste- some hidden symbols could be added while copying from SO — Andersson, Jun 19 '17 at 18:47
Still give me error "Traceback (most recent call last): File "C:\Users\user\Anaconda3\lib\site-packages\selenium\webdriver\common\service.py", line 74, in start stdout=self.log_file, stderr=self.log_file) File "C:\Users\user\Anaconda3\lib\subprocess.py", line 707, in __init__ restore_signals, start_new_session)" — Andre_k, Jun 20 '17 at 04:40

I'm trying to scrape a website with infinite scrolling

0 Answers0