How to separate Japanese texts and English texts from a Pandas Dataframe?

Question

I have a dataframe column which contains both English and Japanese texts. Like in the following manner:

----IDs-------Texts ---------
    132   |  復旧完了。よろしく頼む！ 
    623   |  This is an English text 
    2364  |  "<@UD3JFBREV> 収集した日本語のツイートデータはどこにありますでしょうか" 
    ...   |  .....

Now, I want to separate English texts from Japanese texts from Texts column. My new dataframe should return only English texts ignoring the Japanese texts. How can I do it?

Can you add actual Japanese text in above sample data also? – Dishin H Goyani Sep 15 '20 at 06:27 — Dishin H Goyani, Sep 15 '20 at 06:27
@DishinHGoyani I added two samples. – sksoumik Sep 15 '20 at 06:32 — sksoumik, Sep 15 '20 at 06:32
`df["Texts"].str.extract(r"([A-Za-z0-9\s]+)")`? – Henry Yik Sep 15 '20 at 06:36 — Henry Yik, Sep 15 '20 at 06:36

score 1 · Answer 1 · answered Sep 16 '20 at 08:18

Well, thanks for the above probable solutions. Unfortunately, those didn't solve my problem. The way it worked for my dataset is something like this:

df['Texts'] = df[~df.Texts.str.contains(r'[^\x00-\x7F]', na=False)]

This removes all rows that contain any non-ASCII characters, making those rows as NaN. So, I removed NaN values which gave me all the English sentences from the data frame.

tania · Answer 2 · 2020-09-15T08:07:00.343

0

Japanese text is not ascii (it's Unicode), so you can filter your text on the IS ASCII condition. Assuming your strings can be only Japanese or Engish, you can apply the string method .isascii() to each element of the "Texts" column, as follows (in Python 3.7 and above):

df[df['Texts'].apply(lambda x: x.isascii())]

In the example above, this returns:

ids text
123 This is an English text

In earlier versions of Python, you can do:

df[df['text'].apply(lambda x: len(x.encode('utf8')) == len(x))]

(essentially if after encoding the string as utf8 the length is the same as the total length of the string, then it's ASCII, so must be English.)

You can test how this works by applying it on strings:

"<@UD3JFBREV> 収集した日本語のツイートデータはどこにありますでしょうか".isascii()
False 

"This is an English text".isascii()
True

edited Sep 15 '20 at 08:07

answered Sep 15 '20 at 06:44

tania

2,104
10
18

What about **UDJFBREV** thats also in english right? He said to return english texts – Karthik Sep 15 '20 at 06:49
Good question. I understood that the whole Texts string needs to be in english. Let's see what he says. – tania Sep 15 '20 at 06:51
```df[df['text'].apply(lambda x: x.isascii())]``` That code returns AttributeError: 'str' object has no attribute 'isascii' – sksoumik Sep 15 '20 at 07:43
@Karthik yes, but I only need the texts that don't contain any Japanese characters. – sksoumik Sep 15 '20 at 07:45
@sksoumik I see. Apparently this was introduced in Python 3.7 and above: https://bugs.python.org/issue32677 I'm changing the answer to add an alternate solution. – tania Sep 15 '20 at 08:05

How to separate Japanese texts and English texts from a Pandas Dataframe?

2 Answers2