How to properly encode UTF-8 txt files for R topic model

Question

Similar issues have been discussed on this forum (e.g. here and here), but I have not found the one that solves my problem, so I apologize for a seemingly similar question.

I have a set of .txt files with UTF-8 encoding (see the screenshot). I am trying to run a topic model in R using tm package. However, despite using encoding = "UTF-8" when creating the corpus, I get obvious problems with encoding. For instance, I get < U+FB01 >scal instead of fiscal, in< U+FB02>uenc instead of influence, not all punctuation is removed and some letters are unrecognizable (e.g. quotations marks are still there in some cases like view” or plan’ or ændring or orphaned quotations marks like “ and ” or zit or years—thus with a dash which should have been removed). These terms also show up in topic distribution over terms. I had some problems with encoding before, but using "encoding = "UTF-8" to create the corpus used to solve the problem. It seem like it does not help this time.

I am on Windows 10 x64, R version 3.6.0 (2019-04-26) , 0.7-7 version of tm package (all up to date). I would greatly appreciate any advice on how to address the problem.

library(tm)
library(beepr)
library(ggplot2)
library(topicmodels)
library(wordcloud)
library(reshape2)
library(dplyr)
library(tidytext)
library(scales)
library(ggthemes)
library(ggrepel)
library(tidyr)


inputdir<-"c:/txtfiles/"
docs<- VCorpus(DirSource(directory = inputdir, encoding ="UTF-8"))

#Preprocessing
docs <-tm_map(docs,content_transformer(tolower))

removeURL <- function(x) gsub("http[^[:space:]]*", "", x)
docs <- tm_map(docs, content_transformer(removeURL))

toSpace <- content_transformer(function(x, pattern) (gsub(pattern, " ", x)))
docs <- tm_map(docs, toSpace, "/")
docs <- tm_map(docs, toSpace, "-")
docs <- tm_map(docs, toSpace, "\\.")
docs <- tm_map(docs, toSpace, "\\-")


docs <- tm_map(docs, removePunctuation)
docs <- tm_map(docs, removeNumbers)
docs <- tm_map(docs, removeWords, stopwords("english"))
docs <- tm_map(docs, stripWhitespace)
docs <- tm_map(docs,stemDocument)

dtm <- DocumentTermMatrix(docs)
freq <- colSums(as.matrix(dtm))
ord <- order(freq, decreasing=TRUE)
write.csv(freq[ord],file=paste("word_freq.csv"))

#Topic model
  ldaOut <-LDA(dtm,k, method="Gibbs", 
               control=list(nstart=nstart, seed = seed, best=best, 
                            burnin = burnin, iter = iter, thin=thin))

Edit: I should add in cse it turns out to be relevant that the txt files were created from PDFs using the following R code:

inputdir <-"c:/pdf/"
myfiles <- list.files(path = inputdir, pattern = "pdf",  full.names = TRUE)
lapply(myfiles, function(i) system(paste('"C:/Users/Delt/AppData/Local/Programs/MiKTeX 2.9/miktex/bin/x64/pdftotext.exe"',
                                         paste0('"', i, '"')), wait = FALSE) )

Two sample txt files can be downloaded here.

Please create 2 example txt files with the offending issues and add them to github or some other sharing place. Now it is just guessing. If you are not stuck on using tm to get the data into R, package readtext might help to get the data correct in R. Any other package that ensures the encoding is correct might also do the trick. — phiver, Apr 27 '20 at 16:55
@phiver thank you for your comment. I added two txt files that exhibit most of the offending issues I described. Unless there is no way around this, I would prefer to find a simple solution using *tm* package. I would very much appreciate any advice. — Michael, Apr 28 '20 at 02:52
I have a feeling it has something to do with the pdf reader you use. < U+FB01 >scal, which should be fiscal, is probably not interpreted correctly by the scan, it returns "ﬁscal", note that the f i are not loose letters, but a combined letter, namely a orthographic ligature like æ. What happens if you use the package pdftools to read in the pdfs? You can use pdftools inside tm to read pdfs directly. Or do it seperately first to investigate if it works correctly. — phiver, Apr 28 '20 at 13:34
@phiver than you. I tried this code `text <-pdf_text("c:/txt/1.pdf") write(text, "1.txt")` and the txts have the same problem. Now though even in the txts *fiscal* shows as *< U+FB01 >scal* and when read into corpus it becomes *ufbscal* — Michael, Apr 29 '20 at 01:39

phiver · Accepted Answer · 2020-05-02T10:20:23.340

1

I found a workaround that seems to work correctly on the 2 example files that you supplied. What you need to do first is NFKD (Compatibility Decomposition). This splits the "ﬁ" orthographic ligature into f and i. Luckily the stringi package can handle this. So before doing all the special text cleaning, you need to apply the function stringi::stri_trans_nfkd. You can do this in the preprocessing step just after (or before) the tolower step.

Do read the documentation for this function and the references.

library(tm)
docs<- VCorpus(DirSource(directory = inputdir, encoding ="UTF-8"))

#Preprocessing
docs <-tm_map(docs,content_transformer(tolower))

# use stringi to fix all the orthographic ligature issues 
docs <- tm_map(docs, content_transformer(stringi::stri_trans_nfkd))

toSpace <- content_transformer(function(x, pattern) (gsub(pattern, " ", x)))

# add following line as well to remove special quotes. 
# this uses a replace from textclean to replace the weird quotes 
# which later get removed with removePunctuation
docs <- tm_map(docs, content_transformer(textclean::replace_curly_quote))

....
rest of process
.....

edited May 02 '20 at 10:20

answered Apr 29 '20 at 11:44

phiver

23,048
14
44
56

Thank you so much for your time and effort. I have tested the code and it does show the words properly. However, one issue persists (in a way). The code for removing special quotes `docs <- tm_map(docs, toSpace, "“")` and `docs <- tm_map(docs, toSpace, "‘")` works well, but when I close and reopen the R script the curly quotes become straight one, i.e. `docs <- tm_map(docs, toSpace, """)` and `docs <- tm_map(docs, toSpace, "'")` and this code fails to remove special quotes. Is there now an issue with encoding the R script? I am not sure why it transforms the quotation marks after reopening. – Michael May 01 '20 at 21:04
1

@Michael, Saving a file on a windows machine. Encoding got set to default. I changed the 2 lines with one line with a command from textclean so you avoid encoding issues. – phiver May 02 '20 at 10:21
1

Thanks a lot, it worked out beautifully. I also read the textclean manual and found that `replace_non_ascii` in my case gets rid of weird dashes, dots etc. I greatly appreciate your help! – Michael May 02 '20 at 14:25

How to properly encode UTF-8 txt files for R topic model

1 Answers1

Linked