1

Create document-term matrix

dtm <- DocumentTermMatrix(docs, control = params)

Error in nchar(rownames(m)) : invalid multibyte string, element 1

Anyone who knows how to tackle this error? Working in Rstudio

Allan Cameron
  • 147,086
  • 7
  • 49
  • 87
  • 1
    Please read how to create a [reproducible example](https://stackoverflow.com/questions/5963269/how-to-make-a-great-r-reproducible-example) – Conor Neilson Mar 29 '20 at 03:58

3 Answers3

4
Sys.setlocale( 'LC_ALL','C' ) 

In R studio apply this code .. It will refresh the locale .. worked for me many times.

  • 1
    Your answer could be improved with additional supporting information. Please [edit] to add further details, such as citations or documentation, so that others can confirm that your answer is correct. You can find more information on how to write good answers [in the help center](/help/how-to-answer). – Community Feb 27 '22 at 01:17
  • I despise it because of this error, but it works for me. Thank you very much. – akunyer Apr 27 '22 at 20:20
2

This happens when your input text isn't UTF-8 encoded. You can read about character encoding here.

Another good reference is this

I've found that the best way to handle these issues is to use stringr::str_conv.

mydocs <- c("doc1", "doc2", "doc3")

stringr::str_conv(mydocs, "UTF-8")

Where you have non-UTF-8 characters, you'll get a warning, but the character vector that comes out the other side will be usable.

Do that to your docs vector before calling `DocumentTermMatrix.

Tommy Jones
  • 380
  • 2
  • 10
0

I encountered this error while trying to write a data frame to a SQL server table. This function helped me, I used it to remove all non-UTF8 characters from a data frame before writing it to the server. It's built off another post, linked below.

# Create a function to convert all columns to UTF-8 encoding,
# dropping any characters that can't be converted.
df_convert_utf8 <- function(df_data){

  # Convert all character columns to UTF-8
  # Source: https://stackoverflow.com/questions/54633054/dbidbwritetable-invalid-multibyte-string
  df_data[,sapply(df_data,is.character)] <- sapply(
    df_data[,sapply(df_data,is.character)],
    iconv,"WINDOWS-1252","UTF-8",sub = "")
  
  return(df_data)
}

Example usage:

  # Convert all character strings to UTF8, removing any characters we can't use
  df_chunk <- df_convert_utf8(df_chunk)
Ryan Bradley
  • 627
  • 6
  • 9