How do I remove all the Chinese characters from a string?

Question

I am trying to remove all the Chinese characters from the following string:

x <- "2.87Y 1282501 12电网MTN4 AAA 4.40 /4.30* 2000、"

How can I do this?

Rich Scriven · Answer 1 · 2017-11-02T23:57:52.750

I went Googling around and found a page about Unicode character ranges. After looking through some of the CJK (Chinese, Japanese, Korean) Unicode ranges, I came to the conclusion that you need to remove the following Unicode ranges if all your strings are similar to this particular string.

4E00-9FFF for CJK Unified Ideographs
3000-303F for CJK Symbols and Punctuation

Using gsub(), we can do

gsub("[\U4E00-\U9FFF\U3000-\U303F]", "", x)
# [1] "2.87Y 1282501 12MTN4 AAA 4.40 /4.30* 2000"

Data:

x <- "2.87Y 1282501 12电网MTN4 AAA 4.40 /4.30* 2000、"

score 4 · Answer 2 · answered Nov 02 '17 at 10:29

4

You can also do this using iconv. This will remove all Non-ASCII characters including your Chinese, Japanese, Korean etc.

iconv(x, "latin1", "ASCII", sub="")
#[1] "2.87Y 1282501 12MTN4 AAA 4.40 /4.30* 2000"

answered Nov 02 '17 at 10:29

Santosh M.

2,356
1
17
29

yanshengjia · Answer 3 · 2019-05-24T07:35:46.230

2

Chinese characters' unicode range is \u4E00-\u9FA5

First use re.findall(u'[^\u4E00-\u9FA5]', string) to get the list of non-chinese characters in the string, then scan the string and remove all the characters that not in that list.

Try this:

import re
def strip_chinese(string):
    en_list = re.findall(u'[^\u4E00-\u9FA5]', string)
    for c in string:
        if c not in en_list:
            string = string.replace(c, '')
    return string

edited May 24 '19 at 07:35

answered May 24 '19 at 06:29

yanshengjia

192
1
8

Please add some description to the answer explaning what the code does and why it is written as it is. Thanks. – Tatranskymedved May 24 '19 at 06:48

score 1 · Answer 4 · answered Feb 12 '20 at 04:31

1

This can be done using unicode blocks and stringr package. This answer gives the unicode blocks, there's more than one.

> str_replace_all("先秦兩漢", "[\u2E80-\u2FD5\u3190-\u319f\u3400-\u4DBF\u4E00-\u9FCC\uF900-\uFAAD]", "")
[1] ""

answered Feb 12 '20 at 04:31

CoderGuy123

6,219
5
59
89

How do I remove all the Chinese characters from a string?

4 Answers4

Linked