3

I have seen other posts which show to extract data from multiple webpages

But the problem is that for my website when I scroll the website to see the number of webpages to check in how many pages the data is divided into, the page automatically refresh next data, making unable to identify the number of webpages.I don't have that good knowledge of html and javascript so that I can easily identify the attribute on which the method is been getting called. so I have identified a way by which we can get the number of pages. The website when loaded in browser gives number of records present, accessing that number and divide it by 30(number of data present per page) for e.g if number of records present is 90, then do 90/30 = 3 number of pages

here is the code to get the number of records found on that page

active_name_data1 <- html_nodes(webpage,'.active')
active1 <- html_text(active_name_data1)
as.numeric(gsub("[^\\d]+", "", word(active1[1],start = 1,end =1), perl=TRUE))

AND another approach is that get the attribute for number of pages i.e

url='http://www.magicbricks.com/property-for-sale/residential-real-estate?bedroom=1&proptype=Multistorey-Apartment,Builder-Floor-Apartment,Penthouse,Studio-Apartment&cityName=Thane&BudgetMin=5-Lacs&BudgetMax=10-Lacs'
webpage <- read_html(url)
active_data_html <- html_nodes(webpage,'a.act')
active <- html_text(active_data_html)

here active gives me number of pages i.e "1" " 2" " 3" " 4" SO here I'm unable to identify how do I get the active page data and iterate the other number of webpage so as to get the entire data.

here is what I have tried (uuu_df2 is the dataframe with multiple link for which I want to crawl data)

 library(rvest)
 uuu_df2 <- data.frame(x = c('http://www.magicbricks.com/property-for-
 sale/residential-real-estate?bedroom=1&proptype=Multistorey-Apartment,Builder-
 Floor-Apartment,Penthouse,Studio-Apartment&cityName=Thane&BudgetMin=5-
 Lacs&BudgetMax=5-Lacs',
                            'http://www.magicbricks.com/property-for-sale/residential-real-estate?bedroom=1&proptype=Multistorey-Apartment,Builder-Floor-Apartment,Penthouse,Studio-Apartment&cityName=Thane&BudgetMin=5-Lacs&BudgetMax=10-Lacs',
'http://www.magicbricks.com/property-for-sale/residential-real-estate?bedroom=1&proptype=Multistorey-Apartment,Builder-Floor-Apartment,Penthouse,Studio-Apartment&cityName=Thane&BudgetMin=5-Lacs&BudgetMax=10-Lacs'))

    urlList <- llply(uuu_df2[,1], function(url){     

      this_pg <- read_html(url)

      results_count <- this_pg %>% 
        xml_find_first(".//span[@id='resultCount']") %>% 
        xml_text() %>%
        as.integer()

      if(!is.na(results_count) & (results_count > 0)){

        cards <- this_pg %>% 
          xml_find_all('//div[@class="SRCard"]')

        df <- ldply(cards, .fun=function(x){
          y <- data.frame(wine = x %>% xml_find_first('.//span[@class="agentNameh"]') %>% xml_text(),
                          excerpt = x %>% xml_find_first('.//div[@class="postedOn"]') %>% xml_text(),
                          locality = x %>% xml_find_first('.//span[@class="localityFirst"]') %>% xml_text(),
                          society = x %>% xml_find_first('.//div[@class="labValu"]') %>% xml_text() %>% gsub('\\n', '', .))
          return(y)
        })

      } else {
        df <- NULL
      }

      return(df)   
    }, .progress = 'text')
    names(urlList) <- uuu_df2[,1]

    a=bind_rows(urlList)

But this code just gives me the data from active page and does not iterate through other pages of the given link.

P.S : If the link doesn't has any record the code skips that link and moves to other link from the list.

Any suggestion on what changes should be made to the code will be helpful. Thanks in advance.

Andre_k
  • 1,680
  • 3
  • 18
  • 41
  • This is also called 'infinite scrolling'. Even though it's a question for Python, you might find this question/answer useful: https://stackoverflow.com/questions/12519074/scrape-websites-with-infinite-scrolling – KenHBS Jun 17 '17 at 13:52
  • @KenS. the link you mentioned refers to scraping done in python, is there any same way to do in R,Since I have scraping code ready in R, just need to figure out on how to get the data present on another webpage. – Andre_k Jun 18 '17 at 20:01
  • @deepesh finally did you manage to scrape in R an infinite scroll webpage? – Lazarus Thurston Jan 11 '20 at 11:16
  • No @LazarusThurston, Instead I shifted from R to Python for Scapring exercise. – Andre_k Jan 17 '20 at 09:01
  • Use `remDr$executeScript("window.scrollTo(0,document.body.scrollHeight);")` or refer https://stackoverflow.com/questions/31901072/scrolling-page-in-rselenium to scroll to the bottom of page – Nad Pat Nov 26 '21 at 17:20

0 Answers0