1

I have a dataframe column which contains both English and Japanese texts. Like in the following manner:

----IDs-------Texts ---------
    132   |  復旧完了。よろしく頼む! 
    623   |  This is an English text 
    2364  |  "<@UD3JFBREV> 収集した日本語のツイートデータはどこにありますでしょうか" 
    ...   |  .....

Now, I want to separate English texts from Japanese texts from Texts column. My new dataframe should return only English texts ignoring the Japanese texts. How can I do it?

sksoumik
  • 845
  • 1
  • 9
  • 23

2 Answers2

1

Well, thanks for the above probable solutions. Unfortunately, those didn't solve my problem. The way it worked for my dataset is something like this:

df['Texts'] = df[~df.Texts.str.contains(r'[^\x00-\x7F]', na=False)]

This removes all rows that contain any non-ASCII characters, making those rows as NaN. So, I removed NaN values which gave me all the English sentences from the data frame.

sksoumik
  • 845
  • 1
  • 9
  • 23
0

Japanese text is not ascii (it's Unicode), so you can filter your text on the IS ASCII condition. Assuming your strings can be only Japanese or Engish, you can apply the string method .isascii() to each element of the "Texts" column, as follows (in Python 3.7 and above):

df[df['Texts'].apply(lambda x: x.isascii())]

In the example above, this returns:

ids text
123 This is an English text

In earlier versions of Python, you can do:

df[df['text'].apply(lambda x: len(x.encode('utf8')) == len(x))]

(essentially if after encoding the string as utf8 the length is the same as the total length of the string, then it's ASCII, so must be English.)

You can test how this works by applying it on strings:

"<@UD3JFBREV> 収集した日本語のツイートデータはどこにありますでしょうか".isascii()
False 

"This is an English text".isascii()
True
tania
  • 2,104
  • 10
  • 18
  • What about **UDJFBREV** thats also in english right? He said to return english texts – Karthik Sep 15 '20 at 06:49
  • Good question. I understood that the whole Texts string needs to be in english. Let's see what he says. – tania Sep 15 '20 at 06:51
  • ```df[df['text'].apply(lambda x: x.isascii())]``` That code returns AttributeError: 'str' object has no attribute 'isascii' – sksoumik Sep 15 '20 at 07:43
  • @Karthik yes, but I only need the texts that don't contain any Japanese characters. – sksoumik Sep 15 '20 at 07:45
  • @sksoumik I see. Apparently this was introduced in Python 3.7 and above: https://bugs.python.org/issue32677 I'm changing the answer to add an alternate solution. – tania Sep 15 '20 at 08:05