I thing your regular expression is not correct since you can simplify it.
For example, the sub-expression [{} &+( )" =!.?.:.. / | » © : >< # « ,]
can be simplify
in [ !"#&()+,./:<=>?{|}©«»]
: only keep each character one time.
This is because []
is used to indicate a set of characters.
Take a look at the chapter "Regular expression operations" in the Python documentation.
See: https://docs.python.org/3.4/library/re.html
In the title of your message, you wrote: "Removing symbols from a large unicode text file",
so I think that you have a set of characters you want to remove from your file.
To simplify you set of symbols, you can try:
>>> symbols = "".join(frozenset(r'[{} &+( )" =!.?.:.. / | » © : >< # « ,] 1 2 3 4 5 6 7 8 9 _ - + ; [ ] %'))
>>> print(symbols)
! #"%&)(+-,/.132547698»:=<?>[];_|©{}«
That way you can simply write:
symbols = '! #"%&)(+-,/.132547698»:=<?>[];_|©{}«'
Note for the readers: this is not obvious but all the strings here are unicode strings.
I think, the author use Python 3.
For Python 2.7 users, the best way is to use the "utf8" encoding and the u""
syntax, that way:
# -*- coding: utf8 -*-
symbols = u'! #"%&)(+-,/.132547698»:=<?>[];_|©{}«'
Alternatively, you can import unicode_literals, and drop the "u" prefix:
# -*- coding: utf8 -*-
from __future__ import unicode_literals
symbols = '! #"%&)(+-,/.132547698»:=<?>[];_|©{}«'
If you want to write a regular expression which match one symbol, you have to escape the characters
with specials meanings (for example: "[" should be escaped in "\[").
The best way is to use re.escape
function.
>>> import re
>>> symbols = '! #"%&)(+-,/.132547698»:=<?>[];_|©{}«'
>>> regex = "[{0}]".format(re.escape(symbols))
>>> print(regex)
[\!\ \#\"\%\&\)\(\+\-\,\/\.132547698\»\:\=\<\?\>\[\]\;\_\|\©\{\}\«]
Just have a try:
import re
symbols = '! #"%&)(+-,/.132547698»:=<?>[];_|©{}«'
regex = "[{0}]+".format(re.escape(symbols))
example = '''" :621 " :621 :1 ;" _ " :594 :25 4 8 0 :23 "സര്ക്കാര്ജീവനക്കാരുടെ ശമ്പളം അറിയാന് ഭാര്യമാര്ക്ക് അവകാശമുണ്ട്വിവരാവകാശകമ്മീഷന്
:621 :4 0 3 0 ;" _ " :551 :16 :3 "'''
print(re.sub(regex, "", example, re.UNICODE))
Note that zero isn't in the symbols set but space are, so the result will be:
'''0സര്ക്കാര്ജീവനക്കാരുടെശമ്പളംഅറിയാന്ഭാര്യമാര്ക്ക്അവകാശമുണ്ട്വിവരാവകാശകമ്മീഷന്
00'''
I think the correct symbols set is: !#"%&)(+-,/.0132547698»:=<?>[];_|©{}«
.
Then you can strip each line to remove trailing white spaces...
So this code snippet should work for you:
import re
symbols = '!#"%&)(+-,/.0132547698»:=<?>[];_|©{}«'
regex = "[{0}]+".format(re.escape(symbols))
sub_symbols = re.compile(regex, re.UNICODE).sub
with open('/home/corpus/All12.txt', 'a') as t:
with open('/home/corpus/All11.txt', 'r') as n:
data = n.readline()
data = sub_symbols("", data).strip()
t.write(data)