2

I've a div with 2 p tags.

I need to get the text of the second of this p elements.

<div class="fb-price-list">
      <p class="fb-price">S/  1,699 (Internet)</p>
      <p class="fb-price">S/  2,399 (Normal)</p>
</div>

expected result:

S/  2,399 (Normal)

I've this but is not working:

tvs_url <- read_html("https://www.falabella.com.pe/falabella-pe/category/cat210477/TV-Televisores?page=1")

product_price_actual <- tvs_url %>% 
  html_nodes('div.pod-group pod-group__large-pod div.pod-body div.fb-price-list p.fb-price:nth-child(2)') %>%
  html_text()

html:

<div class="pod-item"><div class="fb-form__input--checkbox fb-pod__item__compare"><input id="fb-pod__item__input-16754140" class="fb-pod__item__compare__input" type="checkbox" name="fb-pod__item__input-16754140" value="16754140"><label for="fb-pod__item__input-16754140" class="fb-pod__item__compare__label">Comparar</label></div><div class="pod-head"><a class="pod-head__image" href="/falabella-pe/product/16754140/LED-UHD-4K-55-Smart-TV-UN55RU7100GXPE-SERIE-RU7100/16754140"><div class="content__image"><img src="//falabella.scene7.com/is/image/FalabellaPE/16754140?wid=544&amp;hei=544&amp;qlt=70&amp;anchor=750,750&amp;crop=0,0,0,0" alt="img" class="image"></div></a><a href="/falabella-pe/product/16754140/LED-UHD-4K-55-Smart-TV-UN55RU7100GXPE-SERIE-RU7100/16754140" class="pod-head__stickerslink"><div class="pod-head__stickers"><div class="fb-responsive-flag fb-responsive-stylised-caps fb-pod__flag fb-pod__flag--percentoff" data-discount-content="">29%</div></div></a></div><div class="pod-body"><a class="section__pod-top" href="/falabella-pe/product/16754140/LED-UHD-4K-55-Smart-TV-UN55RU7100GXPE-SERIE-RU7100/16754140"><div class="section__pod-top-brand">SAMSUNG</div><div class="section__pod-top-title"><div class="LinesEllipsis  ">LED UHD 4K 55" Smart TV UN55RU7100GXPE SERIE RU7100<wbr></div></div></a><div class="section__pod-middle"><div class="section__pod-middle-content__stickers"><div class="fb-responsive-flag fb-responsive-stylised-caps fb-pod__flag fb-pod__flag--percentoff" data-discount-content="">29%</div></div><div class="section__information"><a class="section__information-link" href="/falabella-pe/product/16754140/LED-UHD-4K-55-Smart-TV-UN55RU7100GXPE-SERIE-RU7100/16754140"><div class="fb-price-list"><p class="fb-price">S/  1,699 (Internet)</p><p class="fb-price">S/  2,399 (Normal)</p></div></a></div><div class="section__pod-middle-content__button"><button class="btn-add-to-basket">AGREGAR A TU BOLSA</button></div></div><div class="section__pod-bottom"><div class="fb-pod__rating" style="visibility: hidden;"><a href="/falabella-pe/product/16754140/LED-UHD-4K-55-Smart-TV-UN55RU7100GXPE-SERIE-RU7100/16754140#comments"><div class="fb-rating-stars"><div class="fb-rating-stars__container"><div class="fb-rating-stars__holder"><span class=""><i class="icon-rating"></i></span></div><div class="fb-rating-stars__holder"><span class=""><i class="icon-rating"></i></span></div><div class="fb-rating-stars__holder"><span class=""><i class="icon-rating"></i></span></div><div class="fb-rating-stars__holder"><span class=""><i class="icon-rating"></i></span></div><div class="fb-rating-stars__holder"><span class=""><i class="icon-rating"></i></span></div><p class="fb-rating-stars__count">0 <span class="fb-rating-stars__count__max"> / 5</span></p></div></div></a></div><a class="section__pod-bottom-descriptionlink" href="/falabella-pe/product/16754140/LED-UHD-4K-55-Smart-TV-UN55RU7100GXPE-SERIE-RU7100/16754140"><ul class="section__pod-bottom-description"><li>Modelo: UN55RU7100GXPE</li><li>Tamaño de la pantalla: 55"</li><li>Resolución: 4K Ultra HD</li><li>Tecnología: Led</li><li>Conexión bluetooth: Sí</li></ul></a></div></div></div>

UPDATE 1:

Base on choosen answer I've used ifelse to check the number of characters for a given position:

The position to be supervised is 4th, when there is not a precio_antes (before price) this position is occupied by another element so we need to put NA in those cases:

ifelse(nchar(sapply(splitted, "[", 4))>3, NA, sapply(splitted, "[", 6))

How I'm building the final df:

df <- data.frame(
    brand = sapply(splitted, "[", 2), #We don't need the "comparar" text so we start from 2
    product = sapply(splitted, "[", 3),
    precio_antes = ifelse(nchar(sapply(splitted, "[", 4))>3, NA, sapply(splitted, "[", 6)),
    precio_actual = ifelse(nchar(sapply(splitted, "[", 4))<=3, sapply(splitted, "[", 5), sapply(splitted, "[", 4))
  )
Omar Gonzales
  • 3,806
  • 10
  • 56
  • 120
  • Did any of the answer work for you on your actual URL because when I apply it to `"https://www.falabella.com.pe/falabella-pe/category/cat210477/TV-Televisores?page=1" %>% read_html() %>% html_nodes(".fb-price-list p:nth-child(2)") %>% html_text()` I get `character(0)` – Ronak Shah Jun 28 '19 at 10:58
  • @RonakShah you're right. It is not working in the url, but just in the part of the html I've handle. Weird. Please, let me know if you can help me out with this. – Omar Gonzales Jun 28 '19 at 12:15
  • R u open to use RSelenium? – Tonio Liebrand Jun 29 '19 at 08:20
  • @BigDataScientist yes, I'd use RSelenium, – Omar Gonzales Jun 29 '19 at 15:51

4 Answers4

2

As you also consider RSelenium here is a solution with the corresponding package.

You can find the elements e.g via xpath. In your case the xpath would be: /html/body/div/main/div/div/div/section/div/div/div/div/div/a/div/p.

It is similar to the solution of @gersht, but using RSelenium only.

Reproducible example:

library(RSelenium)

rD <- rsDriver() 
remDr <- rD$client

remDr$navigate(url)
priceElems = remDr$findElements(
  using = "xpath", 
  value = "/html/body/div/main/div/div/div/section/div/div/div/div/div/a/div[@class = 'fb-price-list']"
)

rawPrices = sapply(
  X = priceElems, 
  FUN = function(elem) elem$getElementText()
)

splitted = sapply(
  X = rawPrices, 
  FUN = strsplit, 
  split = "\nS/"
)

prices = data.frame(
  internetPrices = sapply(splitted, "[", 1),
  normalPrices = sapply(splitted, "[", 2)
)

Result / output:

> head(prices, 8)
       internetPrices    normalPrices
1 S/ 1,099 (Internet)  1,599 (Normal)
2 S/ 2,299 (Internet)  3,999 (Normal)
3 S/ 1,699 (Internet)  2,399 (Normal)
4   S/ 999 (Internet)  1,149 (Normal)
5   S/ 999 (Internet)  1,399 (Normal)
6 S/ 1,399 (Internet)  1,699 (Normal)
7 S/ 2,199 (Internet)            <NA>
8 S/ 2,699 (Internet)  4,999 (Normal)

Setup:

If needed, see here on how to setup RSenelium: How to set up rselenium for R?.

Edit:

Following the remark in the comment to also capture empty elements i would get the parent element and then work on the text of the prices.

The parent element is /html/body/div/main/div/div/div/section/div/div/div/div/div/a/div[@class = 'fb-price-list'] and contains an empty string if one of the prices is not available.

Tonio Liebrand
  • 17,189
  • 4
  • 39
  • 59
  • **How are you specifying the xpath for the second `p` tag?** In the `xpath` provided, I just see 1 `p` tag. I need to get the text from both p tags, inside the html I've provided in the answer and make a `data frame` with the elements (also considering that sometimes there could not be a text inside any of the p tags, in that case I'd need a NA to fill that column's row. – Omar Gonzales Jun 29 '19 at 17:05
  • see my edit above,.. if you want to capture "missing prices" i would go for the parent element. – Tonio Liebrand Jun 29 '19 at 17:21
  • ty, based on your answer I've done what I needed. You may see my update to check when there is not a `precio_antes` – Omar Gonzales Jun 30 '19 at 02:48
1

Here I use css to select nodes with class fb-price-list and then select the 2nd p child:

library(rvest)

"<div class=\"pod-item\"><div class=\"fb-form__input--checkbox fb-pod__item__compare\"><input id=\"fb-pod__item__input-16754140\" class=\"fb-pod__item__compare__input\" type=\"checkbox\" name=\"fb-pod__item__input-16754140\" value=\"16754140\"><label for=\"fb-pod__item__input-16754140\" class=\"fb-pod__item__compare__label\">Comparar</label></div><div class=\"pod-head\"><a class=\"pod-head__image\" href=\"/falabella-pe/product/16754140/LED-UHD-4K-55-Smart-TV-UN55RU7100GXPE-SERIE-RU7100/16754140\"><div class=\"content__image\"><img src=\"//falabella.scene7.com/is/image/FalabellaPE/16754140?wid=544&amp;hei=544&amp;qlt=70&amp;anchor=750,750&amp;crop=0,0,0,0\" alt=\"img\" class=\"image\"></div></a><a href=\"/falabella-pe/product/16754140/LED-UHD-4K-55-Smart-TV-UN55RU7100GXPE-SERIE-RU7100/16754140\" class=\"pod-head__stickerslink\"><div class=\"pod-head__stickers\"><div class=\"fb-responsive-flag fb-responsive-stylised-caps fb-pod__flag fb-pod__flag--percentoff\" data-discount-content=\"\">29%</div></div></a></div><div class=\"pod-body\"><a class=\"section__pod-top\" href=\"/falabella-pe/product/16754140/LED-UHD-4K-55-Smart-TV-UN55RU7100GXPE-SERIE-RU7100/16754140\"><div class=\"section__pod-top-brand\">SAMSUNG</div><div class=\"section__pod-top-title\"><div class=\"LinesEllipsis  \">LED UHD 4K 55\" Smart TV UN55RU7100GXPE SERIE RU7100<wbr></div></div></a><div class=\"section__pod-middle\"><div class=\"section__pod-middle-content__stickers\"><div class=\"fb-responsive-flag fb-responsive-stylised-caps fb-pod__flag fb-pod__flag--percentoff\" data-discount-content=\"\">29%</div></div><div class=\"section__information\"><a class=\"section__information-link\" href=\"/falabella-pe/product/16754140/LED-UHD-4K-55-Smart-TV-UN55RU7100GXPE-SERIE-RU7100/16754140\"><div class=\"fb-price-list\"><p class=\"fb-price\">S/  1,699 (Internet)</p><p class=\"fb-price\">S/  2,399 (Normal)</p></div></a></div><div class=\"section__pod-middle-content__button\"><button class=\"btn-add-to-basket\">AGREGAR A TU BOLSA</button></div></div><div class=\"section__pod-bottom\"><div class=\"fb-pod__rating\" style=\"visibility: hidden;\"><a href=\"/falabella-pe/product/16754140/LED-UHD-4K-55-Smart-TV-UN55RU7100GXPE-SERIE-RU7100/16754140#comments\"><div class=\"fb-rating-stars\"><div class=\"fb-rating-stars__container\"><div class=\"fb-rating-stars__holder\"><span class=\"\"><i class=\"icon-rating\"></i></span></div><div class=\"fb-rating-stars__holder\"><span class=\"\"><i class=\"icon-rating\"></i></span></div><div class=\"fb-rating-stars__holder\"><span class=\"\"><i class=\"icon-rating\"></i></span></div><div class=\"fb-rating-stars__holder\"><span class=\"\"><i class=\"icon-rating\"></i></span></div><div class=\"fb-rating-stars__holder\"><span class=\"\"><i class=\"icon-rating\"></i></span></div><p class=\"fb-rating-stars__count\">0 <span class=\"fb-rating-stars__count__max\"> / 5</span></p></div></div></a></div><a class=\"section__pod-bottom-descriptionlink\" href=\"/falabella-pe/product/16754140/LED-UHD-4K-55-Smart-TV-UN55RU7100GXPE-SERIE-RU7100/16754140\"><ul class=\"section__pod-bottom-description\"><li>Modelo: UN55RU7100GXPE</li><li>Tamaño de la pantalla: 55\"</li><li>Resolución: 4K Ultra HD</li><li>Tecnología: Led</li><li>Conexión bluetooth: Sí</li></ul></a></div></div></div>" %>% 
  read_html() %>% 
  html_nodes(".fb-price-list p:nth-child(2)") %>% 
  html_text()
the-mad-statter
  • 5,650
  • 1
  • 10
  • 20
1

tl;dr

Content is dynamically loaded but is available as string, source is javascript dictionary, which can be parsed with json parser after regex to get the string. This is the json currently extracted.

If you use F12 to open dev tools and inspect the page html you will see the script tag housing the javascript dictionary which can be extracted and dealt with via json parser. This does mean you could target the script tag shown then extract text from node and substring, but I prefer regex on a string (see I extract the body as string. Regex is not normally recommended with HTML but with strings is fine).

enter image description here


Code output:

json$state$searchItemList$resultList$prices

gives you a list of length 32 comprising of dataframes. You can see that within each dataframe originalPice houses the info you want (the row where label column == (Normal))

enter image description here

Not every item has an original price. Following is a simple, not necessarily most efficient, way of writing out values:

l <- json$state$searchItemList$resultList$prices

for (i in l){
  if (length(i$originalPrice)>1){
    print(i$originalPrice[2])
  } else {
    print("No original price")
  }
}

R

library(rvest)
library(jsonlite)
library(stringr)

url = 'https://www.falabella.com.pe/falabella-pe/category/cat210477/TV-Televisores?page=1'
r <- read_html(url) %>%
  html_node('body') %>%
  html_text() %>%
  toString()
x <- str_match_all(r,'fbra_browseProductListConfig = (.*);')
json <- jsonlite::fromJSON(x[[1]][,2])
print(json$state$searchItemList$resultList$prices)

Regex explanation:

enter image description here

QHarr
  • 83,427
  • 12
  • 54
  • 101
  • ty. Very interesting finding and solution. Apparently they are using react for the front end and sending the data to be place with as json. I'll need to investigate this a little further, as I see I can also get the product name and brand with this. – Omar Gonzales Jun 29 '19 at 18:56
  • It's easy. The key is title for name and brand for brand – QHarr Jun 29 '19 at 20:49
1

Appears to be dynamic, so the data comes from somewhere else. I looked for GET responses with JSON, XML, etc. with the data but didn't find anything. I would go with RSelenium at this point. The following should extract the correct nodes. You can use whatever method you prefer to extract the numbers from the resultant strings:

# install.packages("RSelenium")
library(RSelenium)
library(rvest)

driver <- rsDriver(4444L, "firefox")
fox_client <- driver$client

url <- "https://www.falabella.com.pe/falabella-pe/category/cat210477/TV-Televisores?page=1"
fox_client$navigate(url = url)

html <- fox_client$getPageSource()[[1]]

read_html(html) %>% 
    html_nodes(".fb-price:nth-child(2)") %>% 
    html_text()

#### OUTPUT ####

 [1] "S/  1,599 (Normal)"  "S/  3,999 (Normal)"  "S/  2,399 (Normal)"  "S/  1,149 (Normal)" 
 [5] "S/  1,399 (Normal)"  "S/  1,699 (Normal)"  "S/  4,999 (Normal)"  "S/  7,999 (Normal)" 
 [9] "S/  3,499 (Normal)"  "S/  12,999 (Normal)" "S/  9,798 (Normal)"  "S/  1,999 (Normal)" 
[13] "S/  2,499 (Normal)"  "S/  1,299 (Normal)"  "S/  2,499 (Normal)"  "S/  3,599 (Normal)" 
[17] "S/  8,999 (Normal)"  "S/  2,499 (Normal)"  "S/  8,599 (Normal)"  "S/  1,499 (Normal)" 
[21] "S/  2,199 (Normal)"  "S/  1,199 (Normal)"  "S/  699 (Normal)"    "S/  999 (Normal)"   
[25] "S/  29,999 (Normal)" "S/  499 (Normal)"    "S/  699 (Normal)"    "S/  4,999 (Normal)" 
[29] "S/  17,999 (Normal)" "S/  1,399 (Normal)" 

You can also navigate through the pages using findElement and clickElement. For more on that see Issue scraping page with "Load more" button with rvest.

  • I had something similar `p.fb-price:nth-child(2)` and you use only `.fb-price:nth-child(2)`, Shouldn't I target the tag and the class? – Omar Gonzales Jun 29 '19 at 18:44