In python, extracting non-English words

Question

I have a text file that have English characters and other language characters. And using code below, I want to extract some words from this file which is not English particularly korean(Unicode range from AC00 to D7AF in UTF-8)

Is there any way to do this simple within this code??

Do I need to do something else?

....
text = f.read()
words = re.findall(r'\w+', dataString)
f.close()
....

Grijesh Chauhan · Answer 1 · 2014-04-01T15:51:23.167

Use capital \W = Matches a non-alphanumeric character, excluding _.

>>> re.findall('[\W]+', u"# @, --►(Q1)-grijesh--b----►((Qf)), ");
[u'# @, --\u25ba(', u')-', u'--', u'----\u25ba((', u')), ']

From: Unicode HOWTO? To read unicoded text file use:

import codecs
f = codecs.open('unicode.rst', encoding='utf-8')
for l in f:
  # regex code here

I have a file:

:~$ cat file
# @, --►(Q1)-grijesh--b----►((Qf)),

Reading it from Python:

>>> import re
>>> import codecs
>>> f = codecs.open('file', encoding='utf-8')
>>> for l in f:
...  print re.findall('[\W]+', l)
... 
[u'# @, --\u25ba(', u')-', u'--', u'----\u25ba((', u')),\n']
>>>

To read alphabetic words try

>>> f = codecs.open('file', encoding='utf-8')
>>> for l in f:
...  print re.findall('[^\W]+', l)
... 
[u'Q1', u'grijesh', u'b', u'Qf']

Note: small \w Matches an alphanumeric character, including _.

score 0 · Answer 2 · answered Apr 01 '14 at 18:32

To find all characters in the range from AC00 to D7AF:

import re

L = re.findall(u'[\uac00-\ud7af]+', data.decode('utf-8'))

To find all non-ascii words:

import re

def isascii(word):
    return all(ord(c) < 128 for c in word)

words = re.findall(u'\w+', data.decode('utf-8'))
non_ascii_words = [w for w in words if not isascii(w)]

In python, extracting non-English words

2 Answers2

Linked