0

I am using R on Windows 10 x64. I am trying to read a set of txt file into R to do text analysis. I am using the following code:

setwd(inputdir)
files <- DirSource(directory = inputdir, encoding ="UTF-8" )
docs<- VCorpus(x=files)
writeLines(as.character(docs[[2]]))

The last line is intended to show the content of the document #2, which this code shows as empty (as well as all other documents in the set). I am not sure why. I checked encoding of the txt document (open, then choose "save as") and my txt files encoding is "Unicode." When I save any of the files as "ANSI" manually, the writeLines(as.character(docs[[2]])) gives me proper content. I thought I should convert all files to ANSI. In that regard, I wanted to ask how can I do that in R for all txt files in my "inputdir"?

Michael
  • 159
  • 1
  • 2
  • 14
  • 1
    you could try the `iconv` (see here :https://stackoverflow.com/questions/7481799/convert-a-file-encoding-using-r-ansi-to-utf-8) and loop it over all txt files (like here: https://stackoverflow.com/questions/14958516/looping-through-all-files-in-directory-in-r-applying-multiple-commands ) – mischva11 Jun 09 '18 at 23:07
  • @mischva11 thank you! I tried this code `lapply(files, writeLines(iconv(readLines(files), from = "UTF8", to = "ANSI_X3.4-1986")))` and i got this error `Error in readLines(files) : 'con' is not a connection`. What am i doing wrong? – Michael Jun 09 '18 at 23:20
  • it seems the lapply function doesnt give the file parameter correct. I tried it with a for loop then it's working fine. Also i'm not sure why i have to split my for loop in single steps, but when i try it with the nested function it removes my data in the txt files, i write my for loop as answer, since it's not fit for comment section – mischva11 Jun 10 '18 at 00:19

1 Answers1

0

get all txt file

files <- list.files(path=getwd(), pattern="*.txt", full.names=T, recursive=FALSE)

loop for converting the encoding and overwrite it

for(i in 1:length(files)){
  input <- readLines(files[i])
  converted_input <- iconv(input, from = file_encoding, to = file_encoding)
  writeLines(converted_input,files[i])
}

possible encodings can be viewed by the iconvlist() command

mischva11
  • 2,811
  • 3
  • 18
  • 34
  • thank you for your advice. I tried the loop with `from = "UTF8", to = "ASCII"` and the loop went without errors, but when i run `docs<- VCorpus(x=files)` I get the following error `Error: inherits(x, "Source") is not TRUE`. – Michael Jun 10 '18 at 17:18
  • I think this is a different question then the previously asked. I don't know the tm package good with the concept of sources. You can try to search for a solution (i found on the fist try a few problems with the same error message) and if you don't find something, i would advice you to open a new question with reproducible data. – mischva11 Jun 10 '18 at 19:48
  • Thank you for your help, mischva11. – Michael Jun 11 '18 at 13:38