0

I'm trying scrape a site with ten pages. I don't know how to do a loop to scrape all the pages, so I tried to create a function to be easier for me to just change the link.

See the function:

link = "https://santabarbara.siscam.com.br/Documentos/Pesquisa/74?Pesquisa=Simples&Pagina=1&Documento=117&Modulo=8&AnoInicial=2022"

scraper <- function(link){
  page = read_html(link)
    titulo = page %>% html_nodes("h4 a") %>% html_text()
    tipo = page %>% html_nodes("h4+ .row .col-md-4") %>% html_text()
    data = page %>% html_nodes("p.col-md-6") %>% html_text()
    protocolo = page %>% html_nodes(".row:nth-child(3) .col-md-4") %>% html_text()
    situacao = page %>% html_nodes(".row~ .row+ .row p.col-md-4:nth-child(1)") %>% html_text()
    regime = page %>% html_nodes("p.col-md-4:nth-child(2)") %>% html_text()
    quorum = page %>% html_nodes(".col-md-4~ .col-md-4+ .col-md-4") %>% html_text()
    autoria = page %>% html_nodes(".row:nth-child(5) .col-md-12") %>% html_text()
    assunto = page %>% html_nodes(".row:nth-child(6) .col-md-12") %>% html_text()
    
    result <- data.frame(titulo, tipo, data, protocolo, situacao, regime, quorum, autoria, assunto)
}

But when I run the function nothing happens.

I'm trying scrape a site with ten pages. I don't know how to do a loop to scrape all the pages, so I tried to create a function to be easier for me to just change the link.

Rui Barradas
  • 70,273
  • 8
  • 34
  • 66
  • 4
    add `return(result)` after result. Right now the function does not return anything. – Jamie Dec 01 '22 at 17:10
  • Or better just remove the ‘result <-‘ bit return statements are not needed in r – user438383 Dec 01 '22 at 17:13
  • @Jamie and user438383 I don't think that's the problem. Absent an explicit `return`, a function will return the last line. Define `foo <- function(x) {result <- x + 1}` and run `y <- foo(1)` and you will see that `y` is `2`, as expected, though with the `result <-` it is returned invisibly. I do agree that style-wise the function would be better without the `result <-`. – Gregor Thomas Dec 01 '22 at 18:23
  • Lucas, please show the code for how you are calling the function. Are you assigning it to an object? What does your attempt at a loop look like? – Gregor Thomas Dec 01 '22 at 18:24
  • @GregorThomas my function already worked now. I didn't do a loop because I don't know how to do it. But I thought about loop because I will copy and past the function for each url that I want to do a scrap. – Lucas Esteves Dec 01 '22 at 18:41
  • If you have a vector of links, call it `my_links`, then `my_results <- lapply(scraper, my_links)` will give you a `list` of results calling the function on each link. – Gregor Thomas Dec 01 '22 at 18:45
  • If you need help generating all the links you need to scrape, then you should give us more info about that. – Gregor Thomas Dec 01 '22 at 18:48
  • @GregorThomas I did what you say about `my_results <- lapply(scraper, my_links)`, but I give this error: "Error in get(as.character(FUN), mode = "function", envir = envir) : object 'my_links' of mode 'function' was not found". – Lucas Esteves Dec 01 '22 at 19:22
  • 1
    @LucasEsteves you need to switch the inputs in lapply. ie. `my_results <- lapply(my_links, scraper)` – Jamie Dec 01 '22 at 19:30
  • Oops, my bad on the order or arguments. Jamie got it! – Gregor Thomas Dec 01 '22 at 19:31
  • Thanks @GregorThomas and @Jamie, but there is another problem now. I ran `my_results <- sapply(my_links, scraper)` and it doesn't transform in a data frame, but in a list. – Lucas Esteves Dec 01 '22 at 19:43
  • Right. If they have the same structure you can combine the list of data frames into one big data frame with `dplyr::bind_rows()`. – Gregor Thomas Dec 01 '22 at 19:46
  • 2
    @GregorThomas, thanks for the feedback. I think I've provided information indicating how `<<-` is different from `return()`. However, I see how that could be a problem for new users. Hence, I will delete that comment. I would also recommend you be a bit more polite in your reply and not include judgmental language. – Ruam Pimentel Dec 01 '22 at 21:35
  • @RuamPimentel I apologize for using language that was judgmental about you and not just about `<<-`. I've deleted my comment as well. I do feel very strongly that using `<<-` for global assignment is bad practice and should **never** be recommended to new R users without a lengthy warning about its many risks and shortcomings. This case is especially inappropriate for `<<-` given that OP wants to use their function in a loop to produce multiple results. – Gregor Thomas Dec 02 '22 at 00:57
  • 1
    That `<<-` is bad in cases like this is a [widely](https://stackoverflow.com/a/5785757/903061) [shared](https://stackoverflow.com/a/17576073/903061) [belief](https://stat.ethz.ch/pipermail/r-help/2011-April/275905.html) [among](https://adv-r.hadley.nz/environments.html#super-assignment--) [expert](https://modernstatisticswithr.com/progchapter.html#fnref41) [R users](https://www.burns-stat.com/pages/Tutor/R_inferno.pdf). – Gregor Thomas Dec 02 '22 at 00:57

1 Answers1

0

Scraping the first 5 pages into a tibble

rm(list = ls())
library(tidyverse)
library(rvest)

get_content <- function(page) {
  content <-
    str_c(
      "https://santabarbara.siscam.com.br/Documentos/Pesquisa/74?Pesquisa=Simples&Pagina=",
      page,
      "&Documento=117&Modulo=8&AnoInicial=2022"
    ) %>%
    read_html() %>%
    html_elements(".data-list-hover")
  
  tibble(
    titulo = content %>% html_nodes("h4 a") %>% html_text2(),
    tipo = content %>% html_nodes("h4+ .row .col-md-4") %>% html_text2(),
    data = content %>% html_nodes("p.col-md-6") %>% html_text2(),
    protocolo = content %>% html_nodes(".row:nth-child(3) .col-md-4") %>% html_text2(),
    situacao = content %>% html_nodes(".row~ .row+ .row p.col-md-4:nth-child(1)") %>%  html_text2(),
    regime = content %>% html_nodes("p.col-md-4:nth-child(2)") %>% html_text2(),
    quorum = content %>% html_nodes(".col-md-4~ .col-md-4+ .col-md-4") %>% html_text2(),
    autoria = content %>% html_nodes(".row:nth-child(5) .col-md-12") %>% html_text2(),
    assunto = content %>% html_nodes(".row:nth-child(6) .col-md-12") %>% html_text2()
    
  ) %>%
    mutate(across(everything(), ~ str_remove_all(.x, "\r") %>%
                    str_squish()))
}

map_dfr(1:5, get_content)
Chamkrai
  • 5,912
  • 1
  • 4
  • 14