I seem to have encountered an enigmatic character in R that breaks my code. I am using R, version 4.2.3.
Take the two strings a
and b
:
a
[1] "Actinomyces naeslundii"
b
[1] "Actinomyces naeslundii"
Despite appearances, a
and b
are not identical.
a==b
[1] FALSE
Consistently, a
does not match b
:
grepl(a,b)
[1] FALSE
Interestingly, not all characters are identical between a
and b
:
strsplit(a, "")[[1]]
[1] "A" "c" "t" "i" "n" "o" "m" "y" "c" "e" "s" " " "n" "a" "e" "s" "l" "u" "n" "d" "i" "i"
strsplit(b, "")[[1]]
[1] "A" "c" "t" "i" "n" "o" "m" "y" "c" "e" "s" " " "n" "a" "e" "s" "l" "u" "n" "d" "i" "i"
strsplit(a, "")[[1]] == strsplit(b, "")[[1]]
[1] TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE FALSE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE
[21] TRUE TRUE
Character #12 is different. It looks like an innocent whitespace, only it isn't:
strsplit(a, "")[[1]][12]
[1] " "
strsplit(b, "")[[1]][12]
[1] " "
strsplit(a, "")[[1]][12] == strsplit(b, "")[[1]][12]
[1] FALSE
" " == strsplit(a, "")[[1]][12]
[1] TRUE
" " == strsplit(b, "")[[1]][12]
[1] FALSE
grepl("\\s", strsplit(a, "")[[1]][12])
[1] TRUE
grepl("\\s", strsplit(b, "")[[1]][12])
[1] FALSE
Using dput
:
dput(a)
"Actinomyces naeslundii"
dput(b)
"Actinomyces naeslundii"
dput(a, file = "a.dput")
dput(b, file = "b.dput")
The generated files differ by one byte:
$ ls -lah *dput
-rw-r--r-- 1 johannes johannes 25 May 16 20:23 a.dput
-rw-r--r-- 1 johannes johannes 26 May 16 20:23 b.dput
Have you encountered this character? What could it be? How can search for it in my data frames?