R: how to webscrape a {{variable}} container

Question

When webscraping, I'm getting: {{price}}. The webbrowser shows the price S/1800.00 (some number), looking the source code is where you see the {{price}}.

This happens for precio.tarjeta, I get all the other variables correctly.

Code:

library(rvest)
library(purrr)
library(tidyverse)

urls <- list("https://www.oechsle.pe/tecnologia/televisores/?&optionOrderBy=OrderByScoreDESC&optionOrderBy=OrderByScoreDESC&O=OrderByScoreDESC&optionOrderBy=OrderByScoreDESC&page=1",
             "https://www.oechsle.pe/tecnologia/televisores/?&optionOrderBy=OrderByScoreDESC&optionOrderBy=OrderByScoreDESC&O=OrderByScoreDESC&optionOrderBy=OrderByScoreDESC&page=2")

h <- urls %>% map(read_html) # scrape once, parse as necessary




df <- map_dfr(h %>%
                map(~ .x %>%
                      html_nodes("div.product")), ~
                data.frame(
                  periodo = lubridate::year(Sys.Date()),
                  fecha = Sys.Date(),
                  ecommerce = "oeschle",
                  marca = .x %>% html_node(".brand") %>% html_text(),
                  producto = .x %>% html_node(".prod-name") %>% html_text(),
                  precio.antes = .x %>% html_node('.ListPrice') %>% html_text(),
                  precio.actual = .x %>% html_node('.BestPrice') %>% html_text(),
                  precio.tarjeta = .x %>% html_node('.tOhPrice') %>% html_text()
                ))

Update 1:

I'm noticing the products repeat themselves, i.e. there is a duplication of products, even when they are page 1 and page 2 with different products in browser.

Why?

As you said, the source contains `{{price}}`, which implies it's dynamic - getting replaced by a javascript routine. If you want to get the page after it's been replaced, you might need to use something like selenium (https://stackoverflow.com/questions/22204382/scraping-javascript-website). But I've never worked with it myself, so can't give you any pointers — Hobo, Sep 17 '21 at 04:18

QHarr · Answer 1 · 2021-09-17T23:34:55.517

how to webscrape a {{variable}} container

The answer here is to spend some time working out how the page updates itself dynamically by studying the page source, the various JS scripts called and the network tab. I guess you could skip just to searching the network tab and hope to find what you want there. You lose out, however, on learning a bit about templating, content providers, how dynamic pages update etc.....

What you are seeing is JavaScript templating. The content provider VTEX provides both the templating and the various scripts which drive the updating of these "placeholders" with actual values e.g. for {{price}} and {{percent}}.

For the purposes of gaining the values you want, it is important to note here, there is an API endpoint called, with the product ids from the page, and that the returned JSON contains the content you are after. You can replicate this request by dynamically extracting the ids and sending the same GET request to the API.

With the aid of a helper function you can extract the discount and then subtract that from the internet price. Whilst the various other prices (and a lot more info) is returned from the API call, I settle on pulling the other prices from the initial GET request.

Here is an example for one url:

library(rvest)
library(tidyverse)
library(jsonlite)

get_oh_price <- function(item) {
  var <- item$items[[1]]$sellers[[1]]$commertialOffer$Teasers
  oh_discount <- 100 * ifelse(length(var) == 0, 0, var[[1]]$`<Effects>k__BackingField`$`<Parameters>k__BackingField`[[2]]$`<Value>k__BackingField`) |> as.numeric()
  return(tibble(id = item$productId, oh_discount))
}

api_prefix <- "https://www.oechsle.pe/api/catalog_system/pub/products/search?fq="
api_suffix <- "&_from=0&_to=49&sc=1"

url <- "https://www.oechsle.pe/tecnologia/televisores/?&optionOrderBy=OrderByScoreDESC&optionOrderBy=OrderByScoreDESC&O=OrderByScoreDESC&optionOrderBy=OrderByScoreDESC&page=1"

page <- read_html(url)

listings <- page |> html_elements("[id^=ResultItems_] li[layout]")

df <- map_dfr(listings, ~
data.frame(
  id = .x |> html_element(".product") |> html_attr("data-id"),
  name = .x |> html_element(".product") |> html_attr("data-name"),
  brand = .x |> html_element(".product") |> html_attr("data-brand"),
  category = .x |> html_element(".product") |> html_attr("data-cat"),
  link = .x |> html_element(".product") |> html_attr("data-link"),
  instock = .x |> html_element(".product") |> html_attr("data-stock"),
  antes = .x |> html_element(".ListPrice") |> html_text() |> str_replace_all("S/.\\s+|,|\\.", "") |> as.numeric(),
  internet = .x |> html_element(".BestPrice") |> html_text() |> str_replace_all("S/.\\s+|,|\\.", "") |> as.numeric(),
  currency = "S/"
))

api_request <- paste0(api_prefix, paste(sprintf("productId:%s,", df$id), sep = "", collapse = ""), api_suffix)

product_data <- jsonlite::read_json(api_request)

discount_df <- map_dfr(product_data, get_oh_price)

df <- df |> inner_join(discount_df, by = "id")

df$oh_price <- map2_dbl(df$internet, df$oh_discount, .f = ~ ifelse(.y == 0, NA_integer_, .x - .y))

I'm analyzing this answer, these API calls are something new to me. So far, I can see that all the data frame, could be formed from the API response, as you mentioned. Just one question, why the answer you gave returns the prices without decimals? Instead of: 3499.00 I get: 349900, and instead of 2299.00 I get: 229900? — Omar Gonzales, Sep 18 '21 at 04:08
I removed the "," and the "." in order to handle as numeric. Partic for the subtraction operation. You can choose to do differently or reformat. — QHarr, Sep 18 '21 at 04:18
I've tested the code from the answer with the urls from original question (page=1 and page=2), and they both return the same items, even when in the browser we can see both pages have different items. why could this be? — Omar Gonzales, Sep 26 '21 at 02:55

R: how to webscrape a {{variable}} container

1 Answers1