0

My dataset contains a lot of surnames. Those surnames are written with umlauts as well as other special characters (such as č,á,ñ, etc.).

By reading the data in the following way (using encoding = "latin1"), I managed to display the umlauts in a proper manner:

read_data <- function(directory,debug=FALSE){
  file_list = list.files(path = directory,
                       pattern = "*.csv",
                       full.names = TRUE);

  df_read = data.frame();

  for (filename in file_list){
    df_temp = read_delim(filename,
                      delim=';',
                      locale = locale(encoding = "latin1"));

    if(debug){
      print(paste0(c(filename, " : ", dim(df_temp))));  
    }

    df_read = rbind(df_read, df_temp);

  }

  names(df_read) = make.names(names(df_read))

  return(df_read)
}

Unfortunately, I cannot display the other special characters in a proper way. Is there another encoding style I can use or another way to read in my csv files including all special characters?

R-User
  • 47
  • 6
  • Why don't you use `encoding = "UTF-8"`? – phiver Jan 06 '20 at 14:24
  • 1
    You need to know the original encoding of the data – Bruno Jan 06 '20 at 14:32
  • What exactly do you mean by `display the other special characters in a proper way`? How are you displaying them? It's easier to help you if you include a simple [reproducible example](https://stackoverflow.com/questions/5963269/how-to-make-a-great-r-reproducible-example) with sample input and desired output that can be used to test and verify possible solutions. Are you absolutely sure the data is stored in a latin1 encoding? Where did the data come from? – MrFlick Jan 06 '20 at 16:02
  • @phiver because then I lose again the umlauts. – R-User Jan 06 '20 at 16:17
  • @Bruno: I converted the excel files to (comma-delimited) csv files. If I chose the encoding "UTF-8" both the umlauts and the other special characters are displayed as "?" or empty white boxes. When I chose the encoding "latin1" only the other special characters are displayed that way. I read that csv stemming from excel files are encoded in latin1 but that does not help me solve my problem. – R-User Jan 06 '20 at 16:47
  • Maybe read the xlsx if you can – Bruno Jan 06 '20 at 16:52

1 Answers1

0

Meanwhile, I tried a lot of different ways to solve my encoding problem. The best I could get so far is by using the following read-in function:

read_data <- function(directory,debug=FALSE){
  file_list = list.files(path = directory,
                       pattern = "*.csv",
                       full.names = TRUE);

  df_read = data.frame();

  for (filename in file_list){
    df_temp = read.csv(filename,encoding="UTF-16LE", sep=";", header=TRUE);

    if(debug){
      print(paste0(c(filename, " : ", dim(df_temp))));  
    }

    df_read = rbind(df_read, df_temp);

  }

  names(df_read) = make.names(names(df_read))

  return(df_read)
}

There is still one special character that is displayed as "?", but the rest of the issue could be solved by using "read.csv" instead of "read_delim" and by using the encoding "UTF-16LE"

R-User
  • 47
  • 6