Strange behaviour of regex in R

Question

I have a simple web scraper that seems to behave strangely:
- in the desktop version of RStudio (running R version 3.3.3 on Windows) it behaves as expected and produces a numeric vector
- in the server version of RStudio (running R version 3.4.1 on Linux) the gsub() (and hence the numeric conversion afterwards) fails, and the code produces a vector of NAs.

Do you have any idea what could cause the difference?

library(rvest)

url <- "http://benzin.impuls.cz/benzin.aspx?strana=3"
impuls <- read_html(url, encoding = "windows-1250")

asdf <- impuls %>%
  html_table()

Benzin <- asdf[[1]]$X7

chrBenzin <- gsub("\\sKč","",Benzin)  # something is wrong here...

numBenzin <- as.double(chrBenzin)
numBenzin

Try variations of the `gsub` - 1) `gsub("[[:space:]]*Kč","",Benzin)`, 2) `gsub("(*UCP)\\s*Kč","",Benzin, perl=TRUE)`. — Wiktor Stribiżew, Aug 27 '17 at 21:33
The local character (fyi Kč is the currency symbol, like $, in Czech) is not the problem; space is. The perl version works (thanks @WiktorStribiżew!) but why - when regular `\\s` does not - beguiles me... — Jindra Lacko, Aug 27 '17 at 22:01

score 3 · Accepted Answer · answered Aug 28 '17 at 06:46

3

The whitespace in the values is a hard space, U+00A0. After I ran the code, I got this output for Benzin (copy/pasted at ideone.com):

Then, I was already sure those were hard spaces, but I doubled checked here.

What shall we do when we have hard spaces is to try two alternatives.

One is using [[:space:]] in a TRE (default regex engine in Base R functions). The other is using a PCRE regex with a (*UCP) verb at the start to let the regex engine know we deal with Unicode.

In your case, on Linux, it seems the PCRE works, thus you should stick to the PCRE version (that is just more consistent than TRE):

gsub("(*UCP)\\s+Kč","",Benzin, perl=TRUE)

A quick online test on Linux R:

Benzin <- "29.60 Kč"
gsub("(*UCP)\\s+Kč","",Benzin, perl=TRUE)
## => [1] "29.60"

answered Aug 28 '17 at 06:46

Wiktor Stribiżew

607,720
39
448
563

@JindraLacko: Just checked and the *stringr* only works for me (on Windows) with `str_replace_all(Benzin, "\\s*K\\u010D", "")`. Have not tried it in Linux. Can't make it work with the literal `č` though :( – Wiktor Stribiżew Aug 28 '17 at 08:17
thanks Wiktor, I really appreciate your help. All is well now :) I will avoid using `stringr` unless necessary (and it is not in this case; I was using `str_split()` somewhere else in my original code and forgot to delete the library call). I will edit my question so it does not confuse anyone in the future. – Jindra Lacko Aug 28 '17 at 08:24
Good, I just wanted to dig into this a bit deeper. *stringr* is based on [ICU regex library](http://userguide.icu-project.org/strings/regexp), and it has its own Unicode idiosyncrasies. – Wiktor Stribiżew Aug 28 '17 at 08:32
Thanks! I was puzzled, as this was my first encounter of platform inconsistency in executing a R script. I am still new to the language... – Jindra Lacko Aug 28 '17 at 09:51

Strange behaviour of regex in R

1 Answers1

Linked