2

I seem to have encountered an enigmatic character in R that breaks my code. I am using R, version 4.2.3.

Take the two strings a and b:

a
[1] "Actinomyces naeslundii"
b
[1] "Actinomyces naeslundii"

Despite appearances, a and b are not identical.

a==b
[1] FALSE

Consistently, a does not match b:

grepl(a,b)
[1] FALSE

Interestingly, not all characters are identical between a and b:

strsplit(a, "")[[1]]
[1] "A" "c" "t" "i" "n" "o" "m" "y" "c" "e" "s" " " "n" "a" "e" "s" "l" "u" "n" "d" "i" "i"
strsplit(b, "")[[1]]
[1] "A" "c" "t" "i" "n" "o" "m" "y" "c" "e" "s" " " "n" "a" "e" "s" "l" "u" "n" "d" "i" "i"
strsplit(a, "")[[1]] == strsplit(b, "")[[1]]
[1]  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE FALSE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE
[21]  TRUE  TRUE

Character #12 is different. It looks like an innocent whitespace, only it isn't:

strsplit(a, "")[[1]][12]
[1] " "
strsplit(b, "")[[1]][12]
[1] " "
strsplit(a, "")[[1]][12] == strsplit(b, "")[[1]][12]
[1] FALSE
" " == strsplit(a, "")[[1]][12]
[1] TRUE
" " == strsplit(b, "")[[1]][12]
[1] FALSE
grepl("\\s", strsplit(a, "")[[1]][12])
[1] TRUE
grepl("\\s", strsplit(b, "")[[1]][12])
[1] FALSE

Using dput:

dput(a)
"Actinomyces naeslundii"
dput(b)
"Actinomyces naeslundii"
dput(a, file = "a.dput")
dput(b, file = "b.dput")

The generated files differ by one byte:

$ ls -lah *dput
-rw-r--r-- 1 johannes johannes 25 May 16 20:23 a.dput
-rw-r--r-- 1 johannes johannes 26 May 16 20:23 b.dput

Have you encountered this character? What could it be? How can search for it in my data frames?

L Tyrone
  • 1,268
  • 3
  • 15
  • 24
Johannes
  • 51
  • 6
  • 1
    Please provide repro data: `dput(a)` and `dplut(b)` – zx8754 May 16 '23 at 20:19
  • 2
    What's the output of `charToRaw(strsplit(b, "")[[1]][12])`? – Brian May 16 '23 at 20:27
  • a and b is the same for me. – zx8754 May 16 '23 at 20:39
  • 2
    Based on your `charToRaw` output, [this seems relevant](https://stackoverflow.com/q/2774471/903061). I would suggest some regex replacement of general whitespace, `gsub(pattern = "\\s+", replacement = " ", b)` should replace any form of whitespace with a normal space. – Gregor Thomas May 16 '23 at 20:58
  • 2
    Note that the `\s` token does not match a no break space in either of the regex flavors used by base R. Use `\h` to match all horizontal whitespace, or match it exactly with `\U00A0`. `\s` will work with `stringr` and `stringi` functions which use a different regex engine (ICU). – Ritchie Sacramento May 17 '23 at 01:48
  • @GregorThomas thank you; the post indeed is relevant. Using `gsub(pattern = "\\s+", replacement = " ", b)`, however, didn't work. – Johannes May 17 '23 at 08:24
  • @RitchieSacramento thank you; matching pattern `\h` works with package stringr, but not with standard `gsub`. I can make a solution work with the approach suggested by you. – Johannes May 17 '23 at 08:27

1 Answers1

3

Thanks to the useful comments, I am now in the position to solve above mystery.

There are at least two modifications that render string b identical to a.

Replacing the unicode character \U00A0 with space (" "):

> b.mod <- gsub("\U00A0", " ", b)
> b.mod == a
[1] TRUE

Replacing horizontal whitespace \h with space (" ") using packages stringr or stringi:

> b.mod1 <- stringi::stri_replace_all(b, " ", regex = "\\h")
> b.mod1 == a
[1] TRUE
> b.mod2 <- stringr::str_replace_all(b, "\\h", " ")
> b.mod2 == a
[1] TRUE

Nevertheless, replacing \h or \s+ with space (" ") does not work with function gsub from package base:

> b.mod3 <- gsub("\\h", " ", b)
> b.mod3 == a
[1] FALSE
> b.mod4 <- gsub("\\s+", " ", b)
> b.mod4 == a
[1] FALSE

Again, thanks to all commenters!

Johannes
  • 51
  • 6