So I have a database which contains the names of multiple people, but the names are encoded in 3 different formats, mainly UTF-8, Latin-1 and ASCII. So I imported the data using data.table
's fread
function making use of both encoding = "UTF-8"
and encoding = "Latin-1"
options. However the observations with ASCII are still erroneously displayed. here is an example of what the first few names look like:
NAMES
1: NICOL<c0>S
2: CAR<d1>PS
3: MU<d1>OZ
4: CATA<d1>O
I want to filter these observations, I tried filtering the data using something like:
data[grepl("\\<c0\\>", NAMES)]
however, this does not work as grepl("\\<c0\\>", NAMES)
returns all results as FALSE
. Indeed the only way I managed to get the desired match was by doing:
data[grepl("À", data$NAMES, useBytes = T)]
However, while I understand what is going on (for the most part) I don't understand why I have to place de ASCII character "À" inside of grepl()
instead of the displayed text by R. Mainly the <c0>
in "NICOL<c0>S"
.
This is a problem because doing this doesn't always work. Particularly with "Ñ", so in the sample data above, rows 2
and 3
have "Ñ" but are instead displayed as <d1>
and substituting Ñ into grepl()
will not work. That is to say the code:
data[grepl("Ñ", data$NAMES, useBytes = T)]
It would also be nice to know if there is a better/more efficient way to do this. Thanks!
Reproducibility
To obtain these results you can download a small sample here please be sure to import the data using fread(stack.csv, encoding = "UTF-8")
to obtain the same results. This is necessary because this file only contains NAMES
in ASCII format and so, not specifying encoding = "UTF-8"
will import the data differently (by actually showing the grave accents, for some reason however this is not replicable at scale with the whole data set). Sorry I couldn't do this through console commands for some reason, this doesn't yield a replicable example.
Edit 1: added Reproducibility section.