Use capital \W
= Matches a non-alphanumeric character, excluding _
.
>>> re.findall('[\W]+', u"# @, --►(Q1)-grijesh--b----►((Qf)), ");
[u'# @, --\u25ba(', u')-', u'--', u'----\u25ba((', u')), ']
From: Unicode HOWTO? To read unicoded text file use:
import codecs
f = codecs.open('unicode.rst', encoding='utf-8')
for l in f:
# regex code here
I have a file:
:~$ cat file
# @, --►(Q1)-grijesh--b----►((Qf)),
Reading it from Python:
>>> import re
>>> import codecs
>>> f = codecs.open('file', encoding='utf-8')
>>> for l in f:
... print re.findall('[\W]+', l)
...
[u'# @, --\u25ba(', u')-', u'--', u'----\u25ba((', u')),\n']
>>>
To read alphabetic words try
>>> f = codecs.open('file', encoding='utf-8')
>>> for l in f:
... print re.findall('[^\W]+', l)
...
[u'Q1', u'grijesh', u'b', u'Qf']
Note: small \w
Matches an alphanumeric character, including _
.