0

I am trying to scrape entire rows of table no. 8 in the following URL "https://www.screener.in/company/HCLTECH/consolidated/"

webpage<-"https://www.screener.in/company/HCLTECH/consolidated/"
Webpage<-read_html(webpage)
CF<- Webpage %>%
html_nodes("table") %>%
 .[8] %>%
html_table(fill = TRUE)

Only able to get the following output instead of the entire table rows which are collapsed on the webpage. How to scrape the collapsed rows on the html table? Output table

Catool
  • 25
  • 7
  • What do your mean with *entire table rows*? When I visit that website I see the same four rows in the cash-flow table like in your code. – Birger Sep 22 '18 at 18:14
  • Please click on the + sign on the respective rows to expand the columns. – Catool Sep 22 '18 at 18:30
  • When you click the `+` sign, the site is making an AJAX/XHR call to an endpoint with an HTTP path that starts with `/api/`. The sites' [`robots.txt`](https://www.screener.in/robots.txt) **expressly forbids** working with those URLs in an automated fashion by the first two lines in that file (which is a legal, techincal control). Hitting that endpoint programmatically is, therefore, a violation of the site [Terms](https://www.screener.in/guides/terms/) and asking others to help you violate those terms could put them into harms' way. – hrbrmstr Sep 22 '18 at 18:34
  • I wasn't aware of the violation though. Just wanted to compile some stuff. Thanks for pointing it out. – Catool Sep 23 '18 at 02:56

1 Answers1

0

I used RSelenium to press those plus signs to expand the table. Here is my try:

library(rvest)
library(Rselenium)

# initialize RSelenium
selCommand <- wdman::selenium(jvmargs = c("-Dwebdriver.chrome.verboseLogging=true"), retcommand = TRUE)
shell(selCommand, wait = FALSE, minimized = TRUE)
remDr <- remoteDriver(port = 4567L, browserName = "chrome")
Sys.sleep(5)
remDr$open()
Sys.sleep(5)

# define and navigate to url
url <-"https://www.screener.in/company/HCLTECH/consolidated/"
remDr$navigate(url)

# click the plus buttons
plus_buttons <- remDr$findElements(using = 'css selector',"#cash-flow button.show-schedules.button-link")
for (plus_button in plus_buttons) {
  plus_button$clickElement()
}

# print the table
remDr$getPageSource(header = TRUE)[[1]] %>%
  read_html() %>%
  html_node("#cash-flow .data-table") %>%
  html_table()

However, as @hrbrmstr has pointed out, check the terms of the webpage. Check that you are respecting them. In my solution, I'm opting for printing instead of storing so I'm not 'copying' anything from their website.

Hope it helped! If you have any question, just let me know!

Unai Sanchez
  • 496
  • 1
  • 6
  • 14
  • thank you for the code, however, I am still new to programming and haven't used Selenium. I am getting error "Error in java_check() : PATH to JAVA not found. Please check JAVA is installed." – Catool Sep 25 '18 at 02:29
  • I think that this should be asqued in another question. Do you have Java installed in your computer? Run `Sys.which("java")`, if you don't get a path to Java, you should start by installing it. You may also try to install the library `rJava` – Unai Sanchez Sep 25 '18 at 07:14
  • @Catool check these links for alternative ways to starting `RSelenium`: [tutorial](https://ropensci.org/tutorials/rselenium_tutorial/) or [StackOverflow question](https://stackoverflow.com/questions/42468831/how-to-set-up-rselenium-for-r) – Unai Sanchez Oct 02 '18 at 08:07