Create document-term matrix
dtm <- DocumentTermMatrix(docs, control = params)
Error in nchar(rownames(m)) : invalid multibyte string, element 1
Anyone who knows how to tackle this error? Working in Rstudio
dtm <- DocumentTermMatrix(docs, control = params)
Error in nchar(rownames(m)) : invalid multibyte string, element 1
Anyone who knows how to tackle this error? Working in Rstudio
Sys.setlocale( 'LC_ALL','C' )
In R studio apply this code .. It will refresh the locale .. worked for me many times.
This happens when your input text isn't UTF-8 encoded. You can read about character encoding here.
Another good reference is this
I've found that the best way to handle these issues is to use stringr::str_conv
.
mydocs <- c("doc1", "doc2", "doc3")
stringr::str_conv(mydocs, "UTF-8")
Where you have non-UTF-8 characters, you'll get a warning, but the character vector that comes out the other side will be usable.
Do that to your docs
vector before calling `DocumentTermMatrix.
I encountered this error while trying to write a data frame to a SQL server table. This function helped me, I used it to remove all non-UTF8 characters from a data frame before writing it to the server. It's built off another post, linked below.
# Create a function to convert all columns to UTF-8 encoding,
# dropping any characters that can't be converted.
df_convert_utf8 <- function(df_data){
# Convert all character columns to UTF-8
# Source: https://stackoverflow.com/questions/54633054/dbidbwritetable-invalid-multibyte-string
df_data[,sapply(df_data,is.character)] <- sapply(
df_data[,sapply(df_data,is.character)],
iconv,"WINDOWS-1252","UTF-8",sub = "")
return(df_data)
}
Example usage:
# Convert all character strings to UTF8, removing any characters we can't use
df_chunk <- df_convert_utf8(df_chunk)