4

I have a simple web scraper that seems to behave strangely:
- in the desktop version of RStudio (running R version 3.3.3 on Windows) it behaves as expected and produces a numeric vector
- in the server version of RStudio (running R version 3.4.1 on Linux) the gsub() (and hence the numeric conversion afterwards) fails, and the code produces a vector of NAs.

Do you have any idea what could cause the difference?

library(rvest)

url <- "http://benzin.impuls.cz/benzin.aspx?strana=3"
impuls <- read_html(url, encoding = "windows-1250")

asdf <- impuls %>%
  html_table()

Benzin <- asdf[[1]]$X7

chrBenzin <- gsub("\\sKč","",Benzin)  # something is wrong here...

numBenzin <- as.double(chrBenzin)
numBenzin
Jindra Lacko
  • 7,814
  • 3
  • 22
  • 44

1 Answers1

3

The whitespace in the values is a hard space, U+00A0. After I ran the code, I got this output for Benzin (copy/pasted at ideone.com):

enter image description here

Then, I was already sure those were hard spaces, but I doubled checked here.

What shall we do when we have hard spaces is to try two alternatives.

One is using [[:space:]] in a TRE (default regex engine in Base R functions). The other is using a PCRE regex with a (*UCP) verb at the start to let the regex engine know we deal with Unicode.

In your case, on Linux, it seems the PCRE works, thus you should stick to the PCRE version (that is just more consistent than TRE):

gsub("(*UCP)\\s+Kč","",Benzin, perl=TRUE)

A quick online test on Linux R:

Benzin <- "29.60 Kč"
gsub("(*UCP)\\s+Kč","",Benzin, perl=TRUE)
## => [1] "29.60"
Wiktor Stribiżew
  • 607,720
  • 39
  • 448
  • 563
  • @JindraLacko: Just checked and the *stringr* only works for me (on Windows) with `str_replace_all(Benzin, "\\s*K\\u010D", "")`. Have not tried it in Linux. Can't make it work with the literal `č` though :( – Wiktor Stribiżew Aug 28 '17 at 08:17
  • thanks Wiktor, I really appreciate your help. All is well now :) I will avoid using `stringr` unless necessary (and it is not in this case; I was using `str_split()` somewhere else in my original code and forgot to delete the library call). I will edit my question so it does not confuse anyone in the future. – Jindra Lacko Aug 28 '17 at 08:24
  • Good, I just wanted to dig into this a bit deeper. *stringr* is based on [ICU regex library](http://userguide.icu-project.org/strings/regexp), and it has its own Unicode idiosyncrasies. – Wiktor Stribiżew Aug 28 '17 at 08:32
  • Thanks! I was puzzled, as this was my first encounter of platform inconsistency in executing a R script. I am still new to the language... – Jindra Lacko Aug 28 '17 at 09:51