Fix text encoding in R

Question

I am having an issue with text encoding that I cannot solve.

I have a string in an excel file that I'm reading into R that looks like: Productâ„¢. With a bit of research, I learned that the â„¢ is UTF-8 that has been read incorrectly as CP-1252.

The UTF-8 hex code for ™ is 0xe2 0x84 0xa2. This has been read as CP-1525: â (E2) „ (84) ¢ (A2).

How can I fix this issue? I have tried using:

iconv("Productâ„¢", "cp1252", "utf-8")

#> [1] "ProductÃ¢â€žÂ¢"

But as you can see, the output is incorrect. The desired output is Product™.

Any ideas about how to fix this issue? The incorrect data is in an Excel spreadsheet, but I am trying to clean the text in R. A solution to fix the original data or a data cleaning solution in R would be great.

score 1 · Accepted Answer · answered May 24 '23 at 22:16

Update: I had the arguments backwards. Turns out the text was being read as UTF-8 while it really should've been CP-1252. I was able to solve by using:

iconv("Productâ„¢", "utf-8", "cp1252")

#> [1] "Product™"

Special thanks to @BalusC and this answer which showed me how to identify which encodings were being used erroneously.

score 0 · Answer 2 · answered May 24 '23 at 22:31

0

you can also try to specify the encoding type when reading a file.

Assuming your file is in csv, you can do something like this:

data <- read.csv("data.csv", encoding="UTF-8")
print(data)

answered May 24 '23 at 22:31

Abdullah Faqih

116
1
7

Fix text encoding in R

2 Answers2