0

I am reading a text file with readtext().
It seems to be encoded in utf-8 (according to notepad++, am unable to verify);
I am not sure if it is encoded correctly or if there are some mistakes/corruption.
File size on disk according to windows explorer is 200+ Mb.
When I read it and check its size in RAM

format(object.size(my_rt), units = "MiB"))

I get

[1] 15 MiB # I manually removed some irrilevant info  

readtext() does not give any error or warning when reading it with

my_rt <- readtext(nomeFile, docvarsfrom = "filenames"
    ,docvarnames = c("lng","country","type","b","c","d")
                   ,dvsep = "[_.]", encoding = "UTF-8"
                   , verbosity = 3)

I am practically sure that the whole file is not read entirely because a slightly bigger file occupies in RAM 198.2 Mib and a smaller file occupies 157 MiB.

Is there a way to understand what is going wrong with readtext and where?
Should I report this as an issue for readtext despite having no understanding of what the problem is?

user778806
  • 67
  • 6
  • Have you tried reading it with other functions (such as those in the `readr` package) to see if you get a similar issue? – Andrew Gustar Apr 15 '18 at 17:56
  • Cannot see a way of using readr as the data are not "rectangular". readlines() has the same problem and based on what I read should be "weaker" than readtext in dealing with encodings and related problems – user778806 Apr 15 '18 at 18:25
  • This should be filed as a **readtext** issue, but we can only diagnose it with access to your file. – Ken Benoit Apr 15 '18 at 20:45
  • Checked the file that was giving problems for unusual characters, found and removed them, now size of .rds files of corpuses look normal. Should I file an issue or NLP packages should not be expected to deal with unusual characters? (At first sight seems similar to this coreNLP question/issue: https://stackoverflow.com/questions/33722024/how-to-remove-non-valid-unicode-characters-from-strings-in-java?noredirect=1&lq=1 ) – user778806 Apr 16 '18 at 19:43

0 Answers0