How to use Regex to strip punctuation without tainting UTF-8 or UTF-16 encoded text like chinese?

Question

How do I strip punctuation from ASCII and UTF-8 encoded strings without messing up the UTF-8 original characters, specifically Chinese, in R.

text <- "Longchamp Le Pliage 肩背包 (小)"
stri_replace_all_regex(text, '\\p{P}', '')

results in:

Longchamp Le Pliage ��背�� 小

but the desired result should be:

Longchamp Le Pliage 肩背包 小

I'm looking to remove all the CJK Symbols and Punctuation as well ask ASCII punctuations.

@akrun, sessionInfo() is as follows

locale:
[1] LC_COLLATE=English_Singapore.1252  LC_CTYPE=English_Singapore.1252    LC_MONETARY=English_Singapore.1252
[4] LC_NUMERIC=C                       LC_TIME=English_Singapore.1252

can you show the `sessionInfo()` Perhaps setting the locale will work for you (it works for me). Check [here](http://stackoverflow.com/questions/20577764/set-locale-to-system-default-utf-8) — akrun, Sep 08 '15 at 06:57
Why not use [`gsub("\\p{P}", "", text, perl=T)`](https://ideone.com/NaDZI6)? — Wiktor Stribiżew, Sep 08 '15 at 06:58
Try `gsub("\\p{P}", "", \`Encoding<-\`(text, "UTF8"), perl=T)` with explicit encoding the text into UTF8. — Wiktor Stribiżew, Sep 08 '15 at 07:09
@akrun, yes i have, but the English_United States.1252 doesn't help either. — Zeke, Sep 08 '15 at 07:17
I can only recommend [this resource](https://support.rstudio.com/hc/en-us/articles/200532197-Character-Encoding) for hints. I'd also try with UTF16 encoding, since Chinese characters are best handled with this Unicode encoding. — Wiktor Stribiżew, Sep 08 '15 at 07:29

score 1 · Accepted Answer · edited May 23 '17 at 12:06

Display of Chinese characters (hanzi) works variably depending on platform and IDE (see this answer for lots of details about R's handling of non-ASCII characters). It looks to me like stri_replace_all_regex is doing what you want, but that some of the hanzi are being displayed wrong (even if their underlying codepoints are correct). Try this:

library(stringi)
my_text <- "Longchamp Le Pliage 肩背包 (小)"
plot(0,0)
text(0, 0, my_text, pos=3)

If you can get the text to display on a plot, then underlyingly the string is properly encoded and the problem is just how it displays in the R terminal. If not, check Encoding(my_text) and consider using enc2utf8 before further text processing. If the plotting worked, try:

no_punct <- stri_replace_all_regex(my_text, "\\p{P}", "")
text(0, 0, no_punct, pos=1)

to see if the result of stri_replace_all_regex is in fact doing what you expect.

Thanks!, the plot those actually verify that the stri_replace_all_regex is indeed doing what its expected to. — Zeke, Sep 10 '15 at 06:29
glad to hear it. That means you should be able to safely do whatever text processing you need, plot, save to external files, etc. If you're still nervous, you could try writing `no_punct` to an external text file and opening it in a unicode-aware text editor (like notepad++, if you're on windows) just to be sure that nothing is getting corrupted. — drammock, Sep 10 '15 at 17:34

How to use Regex to strip punctuation without tainting UTF-8 or UTF-16 encoded text like chinese?

1 Answers1

Linked