0

Several questions on stackoverflow have dealt with the "invalid multibyte string" error in R, which is triggered when certain string-handling functions receive strings that do not have string Encoding set correctly. See stackoverflow.com/questions/14363085/invalid-multibyte-string-in-read-csv or stackoverflow.com/questions/4993837/r-invalid-multibyte-string for answers that deal with how to set Encoding.

My question is: How do I detect where the problem is? This is needed because it may not be possible to know by visual inspection when the bad Encoding is only in a few elements of a long vector, or in columns of a large dataframe.

Community
  • 1
  • 1
jafelds
  • 894
  • 8
  • 12

2 Answers2

4

This simple routine tests for the error, using a base function that triggers the error:

has.invalid.multibyte.string  <- function(x,return.elements=F)
{
      # determine if "invalid multibyte string" error will be triggered
      # if return.elements=T, then output is logical along x, otherwise single logical
      if (is.null(x))
            return(F)
      if (return.elements)
      {
            n <- length(x)
            out <- rep(F,n)
            for (i in 1:n)
                  out[i] <- is.error(try(toupper(x[i]),silent = T))
      }
      else
            out <- is.error(try(toupper(x),silent = T))
      return(out)
}

is.error <- function(x)
{
      # test output of try()
      return(class(x)[1]=="try-error")
}

Example (note the iconv() statement that "corrects" the Encoding):

> a1 <- c("Restaurant","Caf\xe9","Bar")
> a2 <- iconv(a1,from="ISO-8859-1")
> a1
[1] "Restaurant" "Caf\xe9"    "Bar" 
> a2
[1] "Restaurant" "Café"       "Bar"       
> Encoding(a1)
[1] "unknown" "unknown" "unknown"
> Encoding(a2)
[1] "unknown" "UTF-8"   "unknown"
> has.invalid.multibyte.string(a1)
[1] TRUE
> has.invalid.multibyte.string(a2)
[1] FALSE
> has.invalid.multibyte.string(a1,return.elements = T)
[1] FALSE  TRUE FALSE
jafelds
  • 894
  • 8
  • 12
1

Since R version 3.3.0 (released May 2016), base R includes the function validEnc() which returns a logical vector indicating whether each element of a character vector has valid encoding. For example:

validEnc(c("Caf\xe9", "Café"))
# [1]  FALSE TRUE
ianmcook
  • 537
  • 4
  • 10