2

I'm doing some webscraping.

I need to get the actual_price, and put the old_price in another column.

The problem is that not all products have an old_price element, because they are new.

And as they don't have the same length, i cannot join them in a data.frame.

In the case the product has no old_price, i would like to have NA in the cell.

Is there a way to do it with Rvest?

Expected result:

Product      PriceNew        PriceOld
  A          2300.00            NA
  B          9.90              49.00
  C          1299.00           2499.00
  D          829.00            1499.00
  

![enter image description here][1]

As you see, here is an example. One product has actual and old price, the other one not.

I've been doing this:

Celulares_Telefonia_Precio_actual <- html(page_source[[1]]) %>% 
                            html_nodes(".product-itm-price-new") %>%
                            html_text()

Celulares_Telefonia_Precio_antiguo <- html(page_source[[1]]) %>% 
                            html_nodes(".product-itm-price-old") %>%
                            html_text()

All products have a price, but not all have an old price. So for those products with only new price, i would like to have NA in the Old_Price column.

  length(Celulares_Telefonia_Precio_actual)  gives 120

  length(Celulares_Telefonia_Precio_antiguo)  gives 114 

EDIT 1:

Code to reproduce the situation. It is for the Celulares section:

Run Gist to get my data, please:

library(devtools)
source_gist("https://gist.github.com/OmarGonD/b70b712327d7e479f2c7")

EDIT 2:

I've tried looking at the overall container (Product Brand, Product Name, New Price, Old Price). With SelectorGadget i see that the overall container is: "#catalog-items" (correct me if i'm wrong).

So i use:

    Celulares_Telefonia_Catalogo <- html(page_source[[1]]) %>%
  html_nodes("#catalog-items")

But i've no idea how to extract the new and old prices as the question says.

Any hint is welcome.

Omar Gonzales
  • 3,806
  • 10
  • 56
  • 120
  • You need to not iterate over the prices, but the container that holds the prices, and then extract both elements from the parent. If you supplied a [reproducible example](http://stackoverflow.com/questions/5963269/how-to-make-a-great-r-reproducible-example), it would be easier to help you. – MrFlick May 02 '15 at 01:14
  • @MrFlick, i've put my code in a Gist. Please cheack it. – Omar Gonzales May 03 '15 at 02:31

1 Answers1

0
#This may be one solution
library(rvest)
kk1<-html("http://www.linio.com.co/tecnologia/celulares-telefonia-gps/")%>%
   html_nodes(".product-item-price")%>%
   html_text()
#remove spaces
kk2<-gsub("\\s+","",kk1)
#strsplit kk2
kk3<-strsplit(kk2,"\\$|\\-|Nuevo")
#convert to dataframe
kk4<-do.call(rbind,kk3)
kk5<-kk4[,2:3] # column 2 gives you new and column 3 gives you old (blank for no old price)

head(kk5)
     [,1]        [,2]       
[1,] "750.000"   "549.900"  
[2,] "999.900"   "579.900"  
[3,] "2.019.900" "1.729.900"
[4,] "2.399.900" "2.299.900"
[5,] "1.899.000" "1.099.900"
[6,] "2.500.000" "1.799.900"
Metrics
  • 15,172
  • 7
  • 54
  • 83
  • I see what you did. But, i think column 2 gives you the old prices and not the newest. That means that last month a TV costed 100, now costs 50. In your example, column 2 has higher prices. – Omar Gonzales May 21 '15 at 21:00