0

So I have a database which contains the names of multiple people, but the names are encoded in 3 different formats, mainly UTF-8, Latin-1 and ASCII. So I imported the data using data.table's fread function making use of both encoding = "UTF-8" and encoding = "Latin-1" options. However the observations with ASCII are still erroneously displayed. here is an example of what the first few names look like:

                        NAMES         
  1:               NICOL<c0>S
  2:                CAR<d1>PS
  3:                 MU<d1>OZ
  4:                CATA<d1>O

I want to filter these observations, I tried filtering the data using something like:

data[grepl("\\<c0\\>", NAMES)]

however, this does not work as grepl("\\<c0\\>", NAMES) returns all results as FALSE. Indeed the only way I managed to get the desired match was by doing:

data[grepl("À", data$NAMES, useBytes = T)]

However, while I understand what is going on (for the most part) I don't understand why I have to place de ASCII character "À" inside of grepl() instead of the displayed text by R. Mainly the <c0> in "NICOL<c0>S".

This is a problem because doing this doesn't always work. Particularly with "Ñ", so in the sample data above, rows 2 and 3 have "Ñ" but are instead displayed as <d1> and substituting Ñ into grepl() will not work. That is to say the code:

data[grepl("Ñ", data$NAMES, useBytes = T)]

It would also be nice to know if there is a better/more efficient way to do this. Thanks!


Reproducibility

To obtain these results you can download a small sample here please be sure to import the data using fread(stack.csv, encoding = "UTF-8") to obtain the same results. This is necessary because this file only contains NAMES in ASCII format and so, not specifying encoding = "UTF-8" will import the data differently (by actually showing the grave accents, for some reason however this is not replicable at scale with the whole data set). Sorry I couldn't do this through console commands for some reason, this doesn't yield a replicable example.


Edit 1: added Reproducibility section.

cach dies
  • 331
  • 1
  • 14
  • 1
    I don't understand how you can use both "UTF-8" and "LATIN-1" encodings at the same time. Can you please share your data in a [reproducible format](https://stackoverflow.com/questions/5963269/how-to-make-a-great-r-reproducible-example). The output you are showing doesn't look like how R normally prints character values. For me `grepl("", "NICOLS")` matches just fine. But I can't tell if you have literal greater than/less than symbols in your string or if those are being escaped somewhere. What does `Encoding(data$NAMES)` return? What OS are you using? – MrFlick Apr 13 '20 at 05:11
  • @MrFlick, thanks for your help. I imported the data twice, once withe UTF-8 encoding and once with Latin-1, I then combined their outputs. It seems I will have to upload a portion of the data for reproducibility. As to the greater than symbols, I cant tell either and that is part where the difficulty comes from. R does display the columns exactly as shown. Using `data[grepl("\\<", data$NAMES)]` will return those name which are displayed with a `<` symbol. However, using `data[grepl("\\ – cach dies Apr 13 '20 at 14:04
  • @MrFlick, I have a added a Reproducibility section to the main post. – cach dies Apr 13 '20 at 16:08
  • The file you linked to is not ASCII or UTF-8 encoded. It uses Latin-1 encoding. When you read it in with the wrong encoding, bad things will happen. The `<>` part seems to be added by data.table as a way of representing invalid characters. Normally you would find those characters using the hex escape code. For example to find what data.table prints as "" you would use `grepl("\xc0", dt$NAME, useBytes = TRUE)`. I'm not sure what your end goal is here but I would be sure to always read in files with the same encoding that was used when they were created. – MrFlick Apr 13 '20 at 17:55
  • @MrFlick, yes as I noted since these observations *only* contain those observations with ASCII characters it is necessary to use `encoding = "UTF-8"` to replicate the way data is displayed. The original file has 3 separate encodings (possibly more), UTF-8, Latin-1, and ASCII. `fread()` however, can only import UTF-8 and Latin-1 encodings. So while this is not the original file it is meant only to serve as a vehicle for replicability. For this example one could use `encoding = "Latin-1"` and be done with it however that would not work for the original file or the problem at large. – cach dies Apr 13 '20 at 20:38
  • My end goal is to correctly load all the data in to R. and hopefully be able to correct observation like these by using something like `gsub("\xc0", "À", dt$NAME)`. Thank you for your observations on the `<>` characters introduced by `data.table`. I believe that to be a correct answer and I would mark it as such if you posted it. – cach dies Apr 13 '20 at 20:41
  • If you have a file with mixed encodings you have going to have a lot of problems. I'm really doubtful that `gsub` is going to be that helpful. But if it works for you, feel free to post your own answer below. I'm still very confused as to what exactly is going on so I wont post an answer. – MrFlick Apr 13 '20 at 23:44
  • It is a lot of manual work with `gsub()` but I can't think of any other way to properly load all the data into R so it'll have to do. P.S. it sure causes a lot of problems! – cach dies Apr 14 '20 at 00:19

0 Answers0