0

Long-time listener, first-time caller. I have a quick R script to calculate word frequency in a set of Cyrillic text files encoded as UTF-8. This runs great on macOS/Linux, but on Windows, R seems to reading them in as ANSI, so the resulting dataframe/CSV has nonsense characters. What can I do to force R to read the files in as UTF-8 and output Cyrillic in the CSV?

library(stringi)
library(tokenizers)
library(tm)
library(stringr)
library(dplyr)
library(readr)
options(warn = -1)

df = NULL

filenames <- list.files(choose.dir(default = "", caption = "Select folder containing text files"), pattern="*.txt", full.names=TRUE)

for (filename in filenames)
{
conn <- file(filename,open="r")
linn <-readLines(conn)

for (i in 1:length(linn))
  {
  
    res = linn[i]
    res = str_replace_all(res, "[^[:alnum:]]", " ") 
    res = str_replace_all(res, "[[a-zA-z0-9]]", " ")
    res = strsplit(res, " ")
    res_x = gsub("[[:blank:]]", "", res)
    
    tokens = tokenize_words(res_x)
    char_tokens = paste(unlist(tokens),collapse=" ")
    
    frequency_table = sort(table(unlist(strsplit(char_tokens, " "))), decreasing = T)
    words <- strsplit(char_tokens," ")
    words.freq <- table(unlist(words))
    word_frequency <- cbind.data.frame(names(words.freq),as.integer(words.freq))
    
    dataFrame <- word_frequency %>% as_tibble()
    df <- rbind(dataFrame,df)
  }
}

colnames(df) <- c("Word","Frequency")
df = df[order(-df$Frequency),]
df$Word <- gsub("[A-Za-z0-9\u00e9]+", "", df$Word)  #Strip numerals and Latin chars again

df_word = df %>% group_by(Word) %>% summarise(frequency = n())
df_word = df_word[order(df_word$Word),]
colnames(df_word) <- c("Word","Frequency")
df_word <- df_word[-1, ]


path = choose.dir(default = "", caption = "Select destination folder for wordlist") #Win
write_excel_csv2(df_word, file.path(path, "wordlist.csv"))

Sample input (as a UTF-8-encoded .txt):

Lorem ipsum е безсмислен, частично извлечен и умишлено видоизменен пасаж от текст на Цицерон, поради което визуално много приличащ на истински – заради разпределението и честотата на по-къси, средни и дълги думи, разпределението на интервалите и препинателните знаци, както и дължината на изреченията. Разнообразието му дава възможност чрез неговото копиране да се изпълни предвиденото пространство, да не се получат шарки от сорта на моарето вследствие еднотипно редуване на думи и интервали. „Налят“ по този начин в графичните блокове, Lorem ipsum позволява на окото да се абстрахира от конкретиката на смисления текст, и да се съсредоточи само върху особеностите на използвания шрифт и неговото визуално въздействие върху читателя.

Expected output is a data frame with Cyrillic words in col 1 and frequencies in col 2. Actual output is nonsense chars in col 1. Adding encoding="UTF-8" to conn <- file(filename,open="r") produces an empty data frame/CSV.

  • 1
    Specify the correct encoding when calling `file`. It's very hard to help without a specific [reproducible example](https://stackoverflow.com/questions/5963269/how-to-make-a-great-r-reproducible-example) with sample input that we can use to run and test ourselves. If the problem is just with the `file/readLines` part, then all the other code in your question seems unnecessary and irrelevant to the question. – MrFlick Apr 04 '22 at 21:43
  • If the file is already UTF-8 encoded, then you should add the encoding argument to `readLines()`, not `file()`. – Ritchie Sacramento Apr 04 '22 at 21:45
  • @RitchieSacramento Thanks! `linn <-readLines(conn,encoding="UTF-8")` seems to produce an empty `df_word` data frame, just like `conn <- file(filename,open="r",encoding="UTF-8")`. – codexabrogans Apr 04 '22 at 22:27
  • Just so there's no confusion, ensure the `file()` command does **not** contain an encoding argument, it should be in `readLines()` only. As an aside, `readLines()` will take a file name and handle the connection automatically, so you don't need to specify this manually unless you are trying to change the parameters of the connection. If it still fails you can further try `readr::read_lines()`. – Ritchie Sacramento Apr 04 '22 at 22:36
  • @RitchieSacramento Thanks! Yep, no `encoding` in `file()`, but still getting an empty df with both `readLines()` and `read_lines()`... – codexabrogans Apr 04 '22 at 22:47

0 Answers0