5

I am in dire need. I have a corpus that I have converted into a common language, but some of the words were not properly converted into English. Therefore, my corpus has non-ASCII characters such as U+00F8.

I am using Quanteda and I have imported my text using this code:

 EUCorpus <- corpus(textfile(file="/Users/RiohBurke/Documents/RStudio/PROJECT/*.txt"), encodingFrom = "UTF-8-BOM")

My corpus consists of 166 documents. Having imported the documents into R, what would be the best way to get rid of these non-ASCII characters?

BroVic
  • 979
  • 9
  • 26
Ricardo
  • 81
  • 2
  • 6
  • You can do this with iconv . See this answer for details: http://stackoverflow.com/a/9935242/5151349 – mkt Jul 04 '16 at 11:05

1 Answers1

4

Try:

texts(EUCorpus) <- iconv(texts(EUCorpus), from = "UTF-8", to = "ASCII", sub = "")

This converts the encoding to ASCII, replacing any non-translatable characters (those not in the 0-127 ASCII range) to nothingness.

Ken Benoit
  • 14,454
  • 27
  • 50
  • 3
    is `gsub('[^ -~]', '', x)` a possible approach that might be faster? I'm on vacation so no R to test myself. – Tyler Rinker Jul 04 '16 at 15:19
  • How do we know to convert from UTF-8 to ASCII? A document detailing this would be helpful. Thanks! – matsuo_basho Oct 12 '17 at 15:23
  • Try `file mytextfile.txt` at the Terminal, this lists known encodings for text files. There are also some detection methods in **stringi**, i.e. `stri_enc_detect()`. – Ken Benoit Oct 12 '17 at 15:41
  • I prefer to use regex as suggested by @TylerRinker to enable additional manipulations (like removing extra spaces) and I use hex values for readability i.e. `"[^\x20-\x7E]` as proposed [here](https://stackoverflow.com/a/50398057/5420406) – BroVic Aug 15 '19 at 15:21