0

I want to extract the data from this web page, http://old.emmsa.com.pe/emmsa_spv/rpEstadistica/rptVolPreciosDiarios.php, it uses java script at the moment I have not been able to find a way to extract the data of volume and prices of daily frequency. Precios Emmsa precios Emmsa

I have tried many alternatives that are presented on this page but none have worked for me because it is a table that is obtained in two steps.

I have tried to adapt this code that appears here https://www.r-bloggers.com/2020/04/an-adventure-in-downloading-books/ But I couldn't download the data.

my version is :

library(Rcrawler)

install_browser() # One time only

br <- run_browser()

page<-LinkExtractor(url="http://old.emmsa.com.pe/emmsa_spv/rpEstadistica/rptVolPreciosDiarios.php",
                    Browser = br, ExternalLInks = TRUE)


el <- page$InternalLinks
sprlnks <- el[grep("emmsa", el, fixed = TRUE)]

for (sprlnk in sprlnks) {
  spr_page <- LinkExtractor(sprlnk)
  il <- spr_page$InternalLinks
  ttl <- spr_page$Info$Title
  ttl <- trimws(strsplit(ttl, "|", fixed = TRUE)[[1]][1])
  chapter_link <- il[grep("chapter", il, fixed = TRUE)][1]
  chp_splits <- strsplit(chapter_link, "/", fixed = TRUE)
  n <- length(chp_splits[[1]])
  suff <- chp_splits[[1]][n]
  suff <- gsub(".{2}$", "", suff)
  pref <- chp_splits[[1]][n-1]
  final_url <- paste0("http://old.emmsa.com.pe/emmsa_spv/rpEstadistica/rptVolPreciosDiarios.php", pref, "/",
                      suff, ".php")
  print(final_url)
  download.file(final_url, paste0(ttl, ".php"), mode = "wb")
  Sys.sleep(5)
}

stop_browser(br)

I get a file "Empresa Municipal de Mercados S.A.php" that is constantly repeated in which line 294 appears

Finally, what I want is that you can help me generate a script that allows me to download the daily price and volume data from the "emmsa" website.

Paul Samsotha
  • 205,037
  • 37
  • 486
  • 720

1 Answers1

1

You could do a POST request, as the page does and parse out the table from the response

library(httr)
library(rvest)
library(janitor)
library(dplyr)

headers <- c("Content-Type" = "application/x-www-form-urlencoded; charset=UTF-8")

data <- "vid_tipo=1&vprod=&vvari=&vfecha=15/06/2022"

r <- httr::POST(
  url = "http://old.emmsa.com.pe/emmsa_spv/app/reportes/ajax/rpt07_gettable.php",
  httr::add_headers(.headers = headers),
  body = data
)

t <- content(r) %>%
  html_element(".timecard") %>%
  html_table() %>%
  row_to_names(1) %>%
  clean_names() %>%
  dplyr::filter(producto != "") %>%
  mutate_at(vars(matches("precio")), as.numeric)

Volume option (different html)

library(httr)
library(rvest)
library(janitor)
library(dplyr)

headers <- c("Content-Type" = "application/x-www-form-urlencoded; charset=UTF-8")

data <- "vid_tipo=2&vprod=&vvari=&vfecha=17/06/2022"

r <- httr::POST(
  url = "http://old.emmsa.com.pe/emmsa_spv/app/reportes/ajax/rpt07_gettable.php",
  httr::add_headers(.headers = headers),
  body = data
)

t <- content(r) %>%
  html_element("#tbReport") %>%
  html_table()  %>%
  clean_names() 
QHarr
  • 83,427
  • 12
  • 54
  • 101
  • 1
    Thank you, this is fantastic!!, with your help I was able to find the specific answer I needed – Carlos Garibotto Jun 16 '22 at 15:26
  • please see edit to answer – QHarr Jun 18 '22 at 04:54
  • I have tried to make an adaptation your to find var "volumen", `code`: – Carlos Garibotto Jun 21 '22 at 23:24
  • I have tried to make an adaptation your to find var "volumen", `code`: `data <- "vid_tipo=2&vprod=&vvari=&vfecha=15/06/2022" ` and replace variable price by volume in last row. `code` `mutate_at(vars(matches("volumen")), as.numeric)` in chunk t. ` mutate_at(vars(matches("volumen")), as.numeric)` Error in UseMethod("html_table") : no applicable method for 'html_table' applied to an object of class "xml_missing – Carlos Garibotto Jun 21 '22 at 23:38
  • For volume you just run the code as I wrote it. The html and the headers are different. – QHarr Jun 21 '22 at 23:45
  • 1
    Thank you @QHarr, i really needed to find the answer. I don't know anything about web scraping, "Master", you are on another level ! – Carlos Garibotto Jun 22 '22 at 17:08