5

This question seems to make it easy to remove space characters in a string in R. However when I load the following table I'm not able to remove a space between two numbers (eg.11 846.4):

require(XML)
require(RCurl)
require(data.table)

link2fetch = 'https://www.destatis.de/DE/Themen/Branchen-Unternehmen/Landwirtschaft-Forstwirtschaft-Fischerei/Feldfruechte-Gruenland/Tabellen/ackerland-hauptnutzungsarten-kulturarten.html'

theurl = getURL(link2fetch, .opts = list(ssl.verifypeer = FALSE) ) # important!
area_cult10 = readHTMLTable(theurl, stringsAsFactors = FALSE)
area_cult10 = rbindlist(area_cult10)
    
test = sub(',', '.', area_cult10$V5) # change , to . 
test = gsub('(.+)\\s([A-Z]{1})*', '\\1', test) # remove LETTERS
gsub('\\s', '', test[1]) # remove white space?

Why can't I remove the space in test[1]? Thanks for any advice! Can this be something else than a space character? Maybe the answer is really easy and I'm overlooking something.

andschar
  • 3,504
  • 2
  • 27
  • 35
  • 1
    ok, after kniting a html I've discovered that it's not a space but a non-braking space. Looks like this ` ` in a html and can be searched with `\u00A0`. Tricky! – andschar May 02 '17 at 09:35
  • I have tried your code and got `[1] "11846.4"` - no whitespace there. – Wiktor Stribiżew May 02 '17 at 09:39
  • strange. after restarting R and running the code I still get this space `[1] "11 846.4"`. However I can remove it with the above mentioned `\u00A0`. Maybe differing package versions? – andschar May 02 '17 at 09:47
  • 1
    You know, it got removed when I just ran your code. When I started to check if I can improve the regex, it stopped removing the space. I confirm: creating the `test` as you showed, the whitespace disappears. If I use `test1 <- gsub("[\\sA-Za-z]+", "", area_cult10$V5)` to remove all whitespaces and letters, the whitespace remains. And `gsub("[[:space:]A-Za-z]+", "", area_cult10$V5)` works. – Wiktor Stribiżew May 02 '17 at 09:49
  • Try `sub(",", ".", gsub("[[:space:]A-Za-z]+|\\W+$", "", area_cult10$V5), fixed=TRUE)` – Wiktor Stribiżew May 02 '17 at 09:54

1 Answers1

6

You may shorten the test creation to just 2 steps and using just 1 PCRE regex (note the perl=TRUE parameter):

test = sub(",", ".", gsub("(*UCP)[\\s\\p{L}]+|\\W+$", "", area_cult10$V5, perl=TRUE), fixed=TRUE)

Result:

 [1] "11846.4" "6529.2"  "3282.7"  "616.0"   "1621.8"  "125.7"   "14.2"   
 [8] "401.6"   "455.5"   "11.7"    "160.4"   "79.1"    "37.6"    "29.6"   
[15] ""        "13.9"    "554.1"   "236.7"   "312.8"   "4.6"     "136.9"  
[22] "1374.4"  "1332.3"  "1281.8"  "3.7"     "5.0"     "18.4"    "23.4"   
[29] "42.0"    "2746.2"  "106.6"   "2100.4"  "267.8"   "258.4"   "13.1"   
[36] "23.5"    "11.6"    "310.2"  

The gsub regex is worth special attention:

  • (*UCP) - the PCRE verb that enforces the pattern to be Unicode aware
  • [\\s\\p{L}]+ - matches 1+ whitespace or letter characters
  • | - or (an alternation operator)
  • \\W+$ - 1+ non-word chars at the end of the string.

Then, sub(",", ".", x, fixed=TRUE) will replace the first , with a . as literal strings, fixed=TRUE saves performance since it does not have to compile a regex.

Wiktor Stribiżew
  • 607,720
  • 39
  • 448
  • 563
  • Thanks for the detailed explanations! However with `[[:space:]]` I still don't get rid of the non-breaking space. I have to use `test = sub(",", ".", gsub("\u00A0|[[:space:][:alpha:]]+|\\W+$", "", area_cult10$V5), fixed=TRUE)` to make it work. It's still puzzling why it works for you.. – andschar May 02 '17 at 10:50
  • @andrasz: Hm, I have 2 ideas how to solve it in another way, but no idea as to why it fails in different cases. Try also with `gsub` using `"(*UCP)[\\s\\p{L}]+|\\W+$"` pattern while passing `perl=TRUE` argument. Are you on Linux? – Wiktor Stribiżew May 02 '17 at 10:53
  • Yes, on Linux Mint 18 based on Ubuntu 14.04. Does that help? – andschar May 02 '17 at 10:55
  • YES - see [`x <- c("11 846.4 A", "6 529.2 A", "3 282.7 A") gsub("(*UCP)\\s+", "", x, perl=TRUE)`](https://ideone.com/EyJ9r6). – Wiktor Stribiżew May 02 '17 at 10:56
  • ok, thanks! However `\u00A0` is also fine imho. Will have a look into [**PCRE**](https://en.wikipedia.org/wiki/Perl_Compatible_Regular_Expressions) at one point. – andschar May 02 '17 at 11:06
  • 1
    Yes, you may enumerate all the Unicode whitespace code points, and use something like `[ \f\n\r\t\v\u00a0\u1680\u180e\u2000-\u200a\u2028\u2029\u202f\u205f\u3000\ufeff]` (note the escape sequences are compatible with JavaScript, this is taken from MDN site), but when you use `\s` with the `(*UCP)` verb, it will match all Unicode whitespace. No need to worry about it next time. – Wiktor Stribiżew May 02 '17 at 18:45
  • Fair enough! As Unicode white spaces are rather new to me, I'll stick with your `(*UCP*)` suggestion. – andschar May 02 '17 at 19:41