Modify encodings of accented characters in value labels

Question

I am having a very hard time with accented characters in a stata file I have to import into R. I solved one problem over here, but there's another problem.

After import, anytime I use the lookfor command in the labelled package I get this error.

remotes::install_github("sjkiss/cesdata")
library(cesdata)
data("ces19web")
library(labelled)
look_for(ces19web, "vote")

  invalid multibyte string at '<e9>bec Solidaire'

Now I can find one value label that has that label, but it actually appears properly, so I don't know what is going on.

val_labels(ces19web$pes19_provvote)

But, there are other problematic value labels that cause other problems. For example, the value labels for the 13th variable cause this problem.

# This works fine
ces19web %>% 
  select(1:12) %>% 
  look_for(., "[a-z]")
# This chokes

ces19web %>% 
  select(1:13) %>% 
  look_for(., "[a-z]")

# See the accented character
val_labels(ces19web[,13])

I have come up with this way of replacing the accented characters of the second type.

names(val_labels(ces19web$cps19_imp_iss_party))<-iconv(names(val_labels(ces19web$cps19_imp_iss_party)), from="latin1", to="UTF-8")

And this even solves the problem for look_for()

#This now works!
ces19web %>% 
  select(1:13) %>% 
  look_for(., "[a-z]")

But what I need is a way to loop through all of the names of all of the the value labels and make this conversion for all the bungled accented characters.

This is so close, but I don't a know how to save the results of this as the new names for the value labels

ces19web %>% 
#map onto all the variables and get the value labels
  map(., val_labels) %>% 
#map onto each set of value labels
 map(., ~{
#Skip if there are no value labels
    if (!is.null(.x)){
#If not convert the names as above 
names(.x)<-iconv(names(.x), from="latin1", to="UTF-8")
}
    }) ->out
#Compare the 16th variable's value labels in the original
ces19web[,16]
#With the 16th set of value labels after the conversion function above
out[[16]]

But how do I make that conversion actually stick in the original dataset

Thank you!

score 0 · Answer 1 · answered Jun 09 '22 at 07:01

I don't know if I understand your problem correctly (since the explanations are very verbose), but is it just a matter of reassigning the dataframe ?

library(magrittr)
ces19web %<>% #### REASSIGN THE DATAFRAME WITH THE %<>% OPERATOR
#map onto all the variables and get the value labels
  map(., val_labels) %>% 
#map onto each set of value labels
 map(., ~{
#Skip if there are no value labels
    if (!is.null(.x)){
#If not convert the names as above 
names(.x)<-iconv(names(.x), from="latin1", to="UTF-8")
}
    }) ->out
#Compare the 16th variable's value labels in the original
ces19web[,16]
#With the 16th set of value labels after the conversion function above
out[[16]]

No, it's more a matter of transforming the value labels *in place* so that the improperly encoded accented characters are replaced. — spindoctor, Jun 09 '22 at 13:49

score 0 · Accepted Answer · 2022-06-17T15:47:37.013

0

There is a problem with character variables: all encodings are marked as either "unknown" (i.e. no non-ascii characters) or UTF-8, however there are strings which are really latin1 strings: for instance 0xe9 is the latin-1 encoding of "é".

Assuming all character variables are actually latin1, you can do this:

ces19web_corr <- as.data.frame(lapply(ces19web, function(v) {
  if (is.character(v)) {
    Encoding(v) <- "latin1"
    v <- iconv(v, from = "latin1", to = "UTF-8")
  } else if (is.factor(v)) {
    lev <- levels(v)
    Encoding(lev) <- "latin1"
    lev <- iconv(lev, from = "latin1", to = "UTF-8")
    levels(v) <- lev
  }
  v
}))

Alternately, if only some of them have the problem, you will have to select which one to fix.

Side comment: it might be that you applied my fix from the other post to a data file (or some of its columns) which hasn't the problem described in your other question. Then you accidentally forced the wrong encoding, and the code above is just forcing back the right one.

edited Jun 17 '22 at 15:47

answered Jun 13 '22 at 17:53

It doesn't look like this will work on the value labels. – spindoctor Jun 17 '22 at 14:17
@spindoctor A variable with value labels (in Stata) is equivalent in R to a factor: it's stored basically as a vector of integers, with levels. It's not difficult to get the levels and modify them. However, your dataframe does not have any factor variable: if there were in the Stata file, they were imported as character variables. Of course you can factor them afterwards, but I would suggest you apply the changes first. – Jun 17 '22 at 15:26
@spindoctor Anyway, I changed the code snippet accordingly (not tested for factors though). I also fixed a problem which was probably due to a copy/paste from the wrong piece of code. – Jun 17 '22 at 15:43

Modify encodings of accented characters in value labels

2 Answers2