0

I have the following data:

structure(list(QB5B_2 = structure("Car les GAFA sont des sociétés Américaines et de plus les gouvernements qui composent l'Union Européenne ne sont pas d'accord entre elles sur la stratégie à adopter en ce qui les concerne . Exemple les Gafa payent des impots en Irlande car leurs si<ef>", label = "test", format.spss = "A255", display_width = 0L)), row.names = c(NA, 
-1L), class = c("tbl_df", "tbl", "data.frame"))

When I look at this data in RStudios View pane, it looks like proper French text:

View(problem) shows:

enter image description here

However, when looking at the data in the console it gives me:

# A tibble: 1 x 1
  QB5B_2                                                                                                                 
  <chr>                                                                                                                  
1 "Car les GAFA sont des soci\xc3\xa9t\xc3\xa9s Am\xc3\xa9ricaines et de plus les gouvernements qui composent l'Union Eu~

So it's clear there is some character encoding problem.

Now, when I try to export the file to Excel with:

library(writexl)
write_xlsx(problem, "test.xlsx")

it does the exporting but I can't open the file in Excel and instead get an error message that a problem has been encountered. Side note: I can import the Excel file perfectly fine with e.g. readxl::read_xlsx("test.xlsx")

So two questions:

  • How can I prevent these character issues in the first place? Ideally I wouldn't get these weird \xc3\ things in the data.
  • Is there any way to export the file so that it can be opened properly in Excel?
deschen
  • 10,012
  • 3
  • 27
  • 50
  • The \xc3 is part of the encoding of the UTF-8 characters. Do you have encoding set during the import of the file by `encoding = "UTF-8"`? [See here.](https://www.utf8-chartable.de/unicode-utf8-table.pl) – Mario Niepel Dec 09 '20 at 19:27
  • Good question. I'm colelcting my data through an API, but I guess teh default would be UTF-8 encoding, yes. But I always though UTF-8 is sth. good and shoudl be handled by all software tools, so I would expect Excel to be able to open such characters? – deschen Dec 09 '20 at 19:29
  • Actually after downloading the data through API via `httr:POST` I have the following line of code: `download_content <- content(download_request, encoding = "UTF-8")` – deschen Dec 09 '20 at 19:33
  • (but leaving the "encoding = UTF-8" part out, doesn't change anything in the data (just tried). – deschen Dec 09 '20 at 19:39
  • It looks to me like the problem doesn't lie with R. It takes the special character and properly does the UTF-8 encoding according to the data table of special characters. It is a question of how the data is exported to Excel. So the goal is not to 'get rid' of the weird characters/codes. They are correct. It's why does Excel not take it. I don't know much about Excel, but is there maybe an option to import UTF-8 specifically (unless that's the default as well)? – Mario Niepel Dec 09 '20 at 20:30
  • When copy-pasting this definition of `structure()` on my computer it displays correctly in the console, and I can save it to Excel which opens with the proper accents. That's on a US Windows. So I do think the problem is at the API and R steps, not Excel. – Alexlok Dec 09 '20 at 20:56
  • OK, with your structure, `Encoding(problem$QB5B_2)` is latin1, I can reproduce your problem with `Encoding(problem$QB5B_2) <- "UTF-8"`, and solve it back by setting `Encoding() <- "latin1"`. I suspect the text was actually Latin1, but the API is just pretending it's UTF-8 without converting. – Alexlok Dec 09 '20 at 21:00
  • There seems to be an issue with Excel and UTF-8 encoding. See here for more info: https://stackoverflow.com/questions/6002256/is-it-possible-to-force-excel-recognize-utf-8-csv-files-automatically – Mario Niepel Dec 09 '20 at 21:16

2 Answers2

1

I suspect the text is actually encoded as latin1, but the encoding is set to UTF-8. So R tries to read the latin1 as if it was UTF-8 and gets it wrong.

# by default, R used latin1
> Encoding(problem$QB5B_2)
[1] "latin1"

# in that case, no problem to display it
> problem
# A tibble: 1 x 1
  QB5B_2                                                                          
  <chr>                                                                           
1 Car les GAFA sont des sociétés Américaines et de plus les gouvernements qui com~

# But the API set it as UTF-8
> Encoding(problem$QB5B_2) <- "UTF-8"
> problem
# A tibble: 1 x 1
  QB5B_2                                                                          
  <chr>                                                                           
1 Car les GAFA sont des soci\xe9t\xe9s Am\xe9ricaines et de plus les gouvernemen~

# You just need to convert the encoding back
> Encoding(problem$QB5B_2) <- "latin1"
> problem
# A tibble: 1 x 1
  QB5B_2                                                                          
  <chr>                                                                           
1 Car les GAFA sont des sociétés Américaines et de plus les gouvernements qui com~

See also the first example in ?Encoding which is very similar. On a French computer, the locale would be set to latin1 and you can use enc2native().

Alexlok
  • 2,999
  • 15
  • 20
1

Something is quite strange since your input shows a double quote before the text, which usually does not happen when displaying the content of a character-column in a tibble. Look right after the "1":

# A tibble: 1 x 1
 QB5B_2                                                                                                                 
 <chr>                                                                                                                  
1 "Car les GAFA sont des soci\xc3\xa9t\xc3\xa9s Am\xc3\xa9ricaines et de plus les gouvernements qui composent l'Union Eu~

Perhaps a solution is to reencode the variable using iconv():

problem$QB5B_2 <- iconv(problem$QB5B_2, sub = "byte")
problem
# A tibble: 1 x 1
   QB5B_2                                                                                                                
   <chr>                                                                                                                 
1 Car les GAFA sont des sociétés Américaines et de plus les gouvernements qui composent l'Union Européenne ne sont pas …

Another would be to remove the first character:

problem$QB5B_2 <- str_remove(problem$QB5B_2, pattern = "$.")
problem
# A tibble: 1 x 1
   QB5B_2                                                                                                                
   <chr>                                                                                                                 
1 Car les GAFA sont des sociétés Américaines et de plus les gouvernements qui composent l'Union Européenne ne sont pas …

This does not show how to avoid the issue in the first place, but it should sort you out.

One difficulty in doing the debugging here is that dput(), which you probably used to replicate the content does not keep the problem...

  • Incredible, the second solution you provided did solve the problem for me. I'm still not sure how and where the issue occured, but I got it fixed at least for this example. – deschen Dec 13 '20 at 18:35