2

I have a text file that have English characters and other language characters. And using code below, I want to extract some words from this file which is not English particularly korean(Unicode range from AC00 to D7AF in UTF-8)

Is there any way to do this simple within this code??

Do I need to do something else?

....
text = f.read()
words = re.findall(r'\w+', dataString)
f.close()
....
user3473222
  • 121
  • 9

2 Answers2

1

Use capital \W = Matches a non-alphanumeric character, excluding _.

>>> re.findall('[\W]+', u"# @, --►(Q1)-grijesh--b----►((Qf)), ");
[u'# @, --\u25ba(', u')-', u'--', u'----\u25ba((', u')), ']

From: Unicode HOWTO? To read unicoded text file use:

import codecs
f = codecs.open('unicode.rst', encoding='utf-8')
for l in f:
  # regex code here 

I have a file:

:~$ cat file
# @, --►(Q1)-grijesh--b----►((Qf)),

Reading it from Python:

>>> import re
>>> import codecs
>>> f = codecs.open('file', encoding='utf-8')
>>> for l in f:
...  print re.findall('[\W]+', l)
... 
[u'# @, --\u25ba(', u')-', u'--', u'----\u25ba((', u')),\n']
>>> 

To read alphabetic words try

>>> f = codecs.open('file', encoding='utf-8')
>>> for l in f:
...  print re.findall('[^\W]+', l)
... 
[u'Q1', u'grijesh', u'b', u'Qf']

Note: small \w Matches an alphanumeric character, including _.

Grijesh Chauhan
  • 57,103
  • 20
  • 141
  • 208
0

To find all characters in the range from AC00 to D7AF:

import re

L = re.findall(u'[\uac00-\ud7af]+', data.decode('utf-8'))

To find all non-ascii words:

import re

def isascii(word):
    return all(ord(c) < 128 for c in word)

words = re.findall(u'\w+', data.decode('utf-8'))
non_ascii_words = [w for w in words if not isascii(w)]
jfs
  • 399,953
  • 195
  • 994
  • 1,670