what is the best way to remove non-ASCII characters from a text Corpus when using Quanteda in R?

Question

I am in dire need. I have a corpus that I have converted into a common language, but some of the words were not properly converted into English. Therefore, my corpus has non-ASCII characters such as U+00F8.

I am using Quanteda and I have imported my text using this code:

 EUCorpus <- corpus(textfile(file="/Users/RiohBurke/Documents/RStudio/PROJECT/*.txt"), encodingFrom = "UTF-8-BOM")

My corpus consists of 166 documents. Having imported the documents into R, what would be the best way to get rid of these non-ASCII characters?

You can do this with iconv . See this answer for details: http://stackoverflow.com/a/9935242/5151349 — mkt, Jul 04 '16 at 11:05

score 4 · Answer 1 · answered Jul 04 '16 at 12:31

4

Try:

texts(EUCorpus) <- iconv(texts(EUCorpus), from = "UTF-8", to = "ASCII", sub = "")

This converts the encoding to ASCII, replacing any non-translatable characters (those not in the 0-127 ASCII range) to nothingness.

answered Jul 04 '16 at 12:31

Ken Benoit

14,454
27
50

3

is `gsub('[^ -~]', '', x)` a possible approach that might be faster? I'm on vacation so no R to test myself. – Tyler Rinker Jul 04 '16 at 15:19
How do we know to convert from UTF-8 to ASCII? A document detailing this would be helpful. Thanks! – matsuo_basho Oct 12 '17 at 15:23
Try `file mytextfile.txt` at the Terminal, this lists known encodings for text files. There are also some detection methods in **stringi**, i.e. `stri_enc_detect()`. – Ken Benoit Oct 12 '17 at 15:41
I prefer to use regex as suggested by @TylerRinker to enable additional manipulations (like removing extra spaces) and I use hex values for readability i.e. `"[^\x20-\x7E]` as proposed [here](https://stackoverflow.com/a/50398057/5420406) – BroVic Aug 15 '19 at 15:21

what is the best way to remove non-ASCII characters from a text Corpus when using Quanteda in R?

1 Answers1