R: Rvest - getting 2 elements (nodes) at the same time

Question

I'm doing some webscraping.

I need to get the actual_price, and put the old_price in another column.

The problem is that not all products have an old_price element, because they are new.

And as they don't have the same length, i cannot join them in a data.frame.

In the case the product has no old_price, i would like to have NA in the cell.

Is there a way to do it with Rvest?

Expected result:

Product      PriceNew        PriceOld
  A          2300.00            NA
  B          9.90              49.00
  C          1299.00           2499.00
  D          829.00            1499.00

![enter image description here][1]

As you see, here is an example. One product has actual and old price, the other one not.

I've been doing this:

Celulares_Telefonia_Precio_actual <- html(page_source[[1]]) %>% 
                            html_nodes(".product-itm-price-new") %>%
                            html_text()

Celulares_Telefonia_Precio_antiguo <- html(page_source[[1]]) %>% 
                            html_nodes(".product-itm-price-old") %>%
                            html_text()

All products have a price, but not all have an old price. So for those products with only new price, i would like to have NA in the Old_Price column.

  length(Celulares_Telefonia_Precio_actual)  gives 120

  length(Celulares_Telefonia_Precio_antiguo)  gives 114

EDIT 1:

Code to reproduce the situation. It is for the Celulares section:

Run Gist to get my data, please:

library(devtools)
source_gist("https://gist.github.com/OmarGonD/b70b712327d7e479f2c7")

EDIT 2:

I've tried looking at the overall container (Product Brand, Product Name, New Price, Old Price). With SelectorGadget i see that the overall container is: "#catalog-items" (correct me if i'm wrong).

So i use:

    Celulares_Telefonia_Catalogo <- html(page_source[[1]]) %>%
  html_nodes("#catalog-items")

But i've no idea how to extract the new and old prices as the question says.

Any hint is welcome.

You need to not iterate over the prices, but the container that holds the prices, and then extract both elements from the parent. If you supplied a [reproducible example](http://stackoverflow.com/questions/5963269/how-to-make-a-great-r-reproducible-example), it would be easier to help you. — MrFlick, May 02 '15 at 01:14

Metrics · Answer 1 · 2015-05-04T01:27:46.487

0

#This may be one solution
library(rvest)
kk1<-html("http://www.linio.com.co/tecnologia/celulares-telefonia-gps/")%>%
   html_nodes(".product-item-price")%>%
   html_text()
#remove spaces
kk2<-gsub("\\s+","",kk1)
#strsplit kk2
kk3<-strsplit(kk2,"\\$|\\-|Nuevo")
#convert to dataframe
kk4<-do.call(rbind,kk3)
kk5<-kk4[,2:3] # column 2 gives you new and column 3 gives you old (blank for no old price)

head(kk5)
     [,1]        [,2]       
[1,] "750.000"   "549.900"  
[2,] "999.900"   "579.900"  
[3,] "2.019.900" "1.729.900"
[4,] "2.399.900" "2.299.900"
[5,] "1.899.000" "1.099.900"
[6,] "2.500.000" "1.799.900"

edited May 04 '15 at 01:27

answered May 04 '15 at 01:09

Metrics

15,172
7
54
83

I see what you did. But, i think column 2 gives you the old prices and not the newest. That means that last month a TV costed 100, now costs 50. In your example, column 2 has higher prices. – Omar Gonzales May 21 '15 at 21:00

R: Rvest - getting 2 elements (nodes) at the same time

1 Answers1