0

How do I strip punctuation from ASCII and UTF-8 encoded strings without messing up the UTF-8 original characters, specifically Chinese, in R.

text <- "Longchamp Le Pliage 肩背包 (小)"
stri_replace_all_regex(text, '\\p{P}', '')

results in:

Longchamp Le Pliage ��背�� 小

but the desired result should be:

Longchamp Le Pliage 肩背包 小

I'm looking to remove all the CJK Symbols and Punctuation as well ask ASCII punctuations.

@akrun, sessionInfo() is as follows

locale:
[1] LC_COLLATE=English_Singapore.1252  LC_CTYPE=English_Singapore.1252    LC_MONETARY=English_Singapore.1252
[4] LC_NUMERIC=C                       LC_TIME=English_Singapore.1252    
Zeke
  • 89
  • 6

1 Answers1

1

Display of Chinese characters (hanzi) works variably depending on platform and IDE (see this answer for lots of details about R's handling of non-ASCII characters). It looks to me like stri_replace_all_regex is doing what you want, but that some of the hanzi are being displayed wrong (even if their underlying codepoints are correct). Try this:

library(stringi)
my_text <- "Longchamp Le Pliage 肩背包 (小)"
plot(0,0)
text(0, 0, my_text, pos=3)

If you can get the text to display on a plot, then underlyingly the string is properly encoded and the problem is just how it displays in the R terminal. If not, check Encoding(my_text) and consider using enc2utf8 before further text processing. If the plotting worked, try:

no_punct <- stri_replace_all_regex(my_text, "\\p{P}", "")
text(0, 0, no_punct, pos=1)

to see if the result of stri_replace_all_regex is in fact doing what you expect.

Community
  • 1
  • 1
drammock
  • 2,373
  • 29
  • 40
  • Thanks!, the plot those actually verify that the stri_replace_all_regex is indeed doing what its expected to. – Zeke Sep 10 '15 at 06:29
  • glad to hear it. That means you should be able to safely do whatever text processing you need, plot, save to external files, etc. If you're still nervous, you could try writing `no_punct` to an external text file and opening it in a unicode-aware text editor (like notepad++, if you're on windows) just to be sure that nothing is getting corrupted. – drammock Sep 10 '15 at 17:34