1

I'm trying to convert data scraped from book depository, bests selling books into numeric data so that I can graph it.

My code currently is:

selector <- ".rrp" 
library(rvest)
url <- "https://www.bookdepository.com/bestsellers"
doc <- read_html(url)
prices <- html_nodes(doc, selector)
html_text(prices)
library(readr)
Spiral <- read_csv("C:/Users/Ellis/Desktop/INFO204/Spiral.csv")
View(Spiral)

My attempting to clean the data:

text <- gsub('[$NZ]', '', Spiral) # removes NZ$ from data

But the data now looks like this:

[1] "c(\"16.53\", \"55.15\", \"36.39\", \"10.80\", \"27.57\", \"34.94\", 
\"27.57\", \"22.06\", \"22.00\", \"16.20\", \"22.06\", \"22.06\", 
\"19.84\", \"19.81\", \"27.63\", \"22.06\", \"10.80\", \"27.57\", 
\"22.06\", \"22.94\", \"16.53\", \"25.36\", \"27.57\", \"11.01\", 
\"14.40\", \"15.39\")" 

and when I try run:

as.numeric(text)

I get:

Warning message:
NAs introduced by coercion

How do I clean the data up in such a way that NZ$ is removed from the price and I'm able to plot the 'cleaned data'

Jaap
  • 81,064
  • 34
  • 182
  • 193
Ellis Tagg
  • 11
  • 3
  • Maybe your data is in factor format and not in character format. In that case see: [*How to convert a factor to an integer\numeric without a loss of information?*](https://stackoverflow.com/questions/3418128/how-to-convert-a-factor-to-an-integer-numeric-without-a-loss-of-information) – Jaap Sep 03 '17 at 10:56

2 Answers2

1

You have a single string that contains code, not numbers. You need to evaluate the code first.

as.numeric(eval(parse(text=text)))
 [1] 16.53 55.15 36.39 10.80 27.57 34.94 27.57 22.06 22.00 16.20 22.06 22.06 19.84
[14] 19.81 27.63 22.06 10.80 27.57 22.06 22.94 16.53 25.36 27.57 11.01 14.40 15.39
G5W
  • 36,531
  • 10
  • 47
  • 80
1

Several options to get the desired outcome:

# option 1
as.numeric(gsub('(\\d+.\\d+).*', '\\1', html_text(prices)))
# option 2
as.numeric(gsub('\\s.*$', '', html_text(prices)))
# option 3
library(readr)
parse_number(html_text(prices))

all result in:

 [1] 21.00  9.99 31.49 19.49  6.49 13.50 22.49 11.99 11.49  7.99 10.99  7.99 10.99  9.99  7.99  9.99 11.49  8.49 11.99  9.99 14.95  8.99 20.13 13.50  8.49  6.49

NOTES:

  • The result is a vector of prices in euros. Due to localisation prices may differ when you scrape from another county.
  • When the decimal spearator is a comma (,) in html_text(prices), the first two options can be changed to as.numeric(gsub('(\\d+),(\\d+).*', '\\1.\\2', html_text(prices))) to get the correct result. The third option should in that case be changed to: parse_number(html_text(prices), locale = locale(decimal_mark = ','))
Jaap
  • 81,064
  • 34
  • 182
  • 193