I have seen other posts which show to extract data from multiple webpages
But the problem is that for my website when I scroll the website to see the number of webpages to check in how many pages the data is divided into, the page automatically refresh next data, making unable to identify the number of webpages.I don't have that good knowledge of html and javascript so that I can easily identify the attribute on which the method is been getting called. so I have identified a way by which we can get the number of pages. The website when loaded in browser gives number of records present, accessing that number and divide it by 30(number of data present per page) for e.g if number of records present is 90, then do 90/30 = 3 number of pages
here is the code to get the number of records found on that page
active_name_data1 <- html_nodes(webpage,'.active')
active1 <- html_text(active_name_data1)
as.numeric(gsub("[^\\d]+", "", word(active1[1],start = 1,end =1), perl=TRUE))
AND another approach is that get the attribute for number of pages i.e
url='http://www.magicbricks.com/property-for-sale/residential-real-estate?bedroom=1&proptype=Multistorey-Apartment,Builder-Floor-Apartment,Penthouse,Studio-Apartment&cityName=Thane&BudgetMin=5-Lacs&BudgetMax=10-Lacs'
webpage <- read_html(url)
active_data_html <- html_nodes(webpage,'a.act')
active <- html_text(active_data_html)
here active gives me number of pages i.e "1" " 2" " 3" " 4"
SO here I'm unable to identify how do I get the active page data and iterate the other number of webpage so as to get the entire data.
here is what I have tried (uuu_df2
is the dataframe with multiple link for which I want to crawl data)
library(rvest)
uuu_df2 <- data.frame(x = c('http://www.magicbricks.com/property-for-
sale/residential-real-estate?bedroom=1&proptype=Multistorey-Apartment,Builder-
Floor-Apartment,Penthouse,Studio-Apartment&cityName=Thane&BudgetMin=5-
Lacs&BudgetMax=5-Lacs',
'http://www.magicbricks.com/property-for-sale/residential-real-estate?bedroom=1&proptype=Multistorey-Apartment,Builder-Floor-Apartment,Penthouse,Studio-Apartment&cityName=Thane&BudgetMin=5-Lacs&BudgetMax=10-Lacs',
'http://www.magicbricks.com/property-for-sale/residential-real-estate?bedroom=1&proptype=Multistorey-Apartment,Builder-Floor-Apartment,Penthouse,Studio-Apartment&cityName=Thane&BudgetMin=5-Lacs&BudgetMax=10-Lacs'))
urlList <- llply(uuu_df2[,1], function(url){
this_pg <- read_html(url)
results_count <- this_pg %>%
xml_find_first(".//span[@id='resultCount']") %>%
xml_text() %>%
as.integer()
if(!is.na(results_count) & (results_count > 0)){
cards <- this_pg %>%
xml_find_all('//div[@class="SRCard"]')
df <- ldply(cards, .fun=function(x){
y <- data.frame(wine = x %>% xml_find_first('.//span[@class="agentNameh"]') %>% xml_text(),
excerpt = x %>% xml_find_first('.//div[@class="postedOn"]') %>% xml_text(),
locality = x %>% xml_find_first('.//span[@class="localityFirst"]') %>% xml_text(),
society = x %>% xml_find_first('.//div[@class="labValu"]') %>% xml_text() %>% gsub('\\n', '', .))
return(y)
})
} else {
df <- NULL
}
return(df)
}, .progress = 'text')
names(urlList) <- uuu_df2[,1]
a=bind_rows(urlList)
But this code just gives me the data from active page and does not iterate through other pages of the given link.
P.S : If the link doesn't has any record the code skips that link and moves to other link from the list.
Any suggestion on what changes should be made to the code will be helpful. Thanks in advance.