I am trying to remove all the Chinese characters from the following string:
x <- "2.87Y 1282501 12电网MTN4 AAA 4.40 /4.30* 2000、"
How can I do this?
I am trying to remove all the Chinese characters from the following string:
x <- "2.87Y 1282501 12电网MTN4 AAA 4.40 /4.30* 2000、"
How can I do this?
I went Googling around and found a page about Unicode character ranges. After looking through some of the CJK (Chinese, Japanese, Korean) Unicode ranges, I came to the conclusion that you need to remove the following Unicode ranges if all your strings are similar to this particular string.
4E00-9FFF
for CJK Unified Ideographs3000-303F
for CJK Symbols and PunctuationUsing gsub()
, we can do
gsub("[\U4E00-\U9FFF\U3000-\U303F]", "", x)
# [1] "2.87Y 1282501 12MTN4 AAA 4.40 /4.30* 2000"
Data:
x <- "2.87Y 1282501 12电网MTN4 AAA 4.40 /4.30* 2000、"
You can also do this using iconv
. This will remove all Non-ASCII characters including your Chinese, Japanese, Korean etc.
iconv(x, "latin1", "ASCII", sub="")
#[1] "2.87Y 1282501 12MTN4 AAA 4.40 /4.30* 2000"
Chinese characters' unicode range is \u4E00-\u9FA5
First use re.findall(u'[^\u4E00-\u9FA5]', string)
to get the list of non-chinese characters in the string, then scan the string and remove all the characters that not in that list.
Try this:
import re
def strip_chinese(string):
en_list = re.findall(u'[^\u4E00-\u9FA5]', string)
for c in string:
if c not in en_list:
string = string.replace(c, '')
return string
This can be done using unicode blocks and stringr package. This answer gives the unicode blocks, there's more than one.
> str_replace_all("先秦兩漢", "[\u2E80-\u2FD5\u3190-\u319f\u3400-\u4DBF\u4E00-\u9FCC\uF900-\uFAAD]", "")
[1] ""