Removing symbols from a large unicode text file

Question

I have a text file that contains Unicode texts sizing 2GB approximately. I tried to remove all symbols using following code

import re
symbols = re.compile(r'[{} &+( )" =!.?.:.. / |  » © : >< #  «  ,] 1 2 3 4 5 6 7 8 9 _ - + ; [ ]  %',flags=re.UNICODE)

with open('/home/corpus/All12.txt','a') as t:
    with open('/home/corpus/All11.txt', 'r') as n:
        data = n.readline()          
        data = symbols.sub(" ", data)          
        t.write(data)

A small file for testing the code:

:621   

"

    :621       "
    :621               :1                ;"
     _            "         :594            :25   4   8   0        :23          "സര്‍ക്കാര്‍ജീവനക്കാരുടെ ശമ്പളം അറിയാന്‍ ഭാര്യമാര്‍ക്ക് അവകാശമുണ്ട്വിവരാവകാശകമ്മീഷന്‍    
    :621            :4   0   3   0  ;"
     _           "         :551             :16        :3  " 

     :12     :70                ;"                  "             "     =""                   "               "     =""                     "            "     ="" +    


     _                       "         :541             :26       :30   45   5   35  " 
 ='                  'ന്യൂഡല്‍ഹി: സര്‍ക്കാര്‍ജീവനക്കാരായ ഭര്‍ത്താക്കന്മാരുടെ ശമ്പളം

The desire output is ന്യൂഡല്‍ഹി സര്‍ക്കാര്‍ജീവനക്കാരായ ഭര്‍ത്താക്കന്മാരുടെ ശമ്പളം. The code is not functioning. It stops my computer.

Can I solve this problem with out regular expression ?

Are you sure the code is not functioning? Maybe it's just taking a long time, which is likely given that you're reading a 2GB text file. Try adding a `print` inside your loop. — DanielGibbs, Nov 21 '14 at 13:38
Try reading the file by chunks http://stackoverflow.com/questions/519633/lazy-method-for-reading-big-file-in-python — Tanveer Alam, Nov 21 '14 at 13:39
@DanielGibbs I am sure. It is quieting after running for some time. There is no loop here — , Nov 21 '14 at 13:40
I suggest you produce a small sample file to test on, otherwise you will find it very difficult to isolate the problem. Consider splitting the script into three functions (read the data in, process the data, write the data out) so you can test each in isolation. — jonrsharpe, Nov 21 '14 at 13:54
How many lines in your 2 GiB file? If there's only a few (or one...), iterating one line at a time won't help you much. — Kevin J. Chase, Nov 26 '14 at 08:16
So, you only want to preserve letters and whitespaces? Do you want to remove consecutive whitespaces too? — Casimir et Hippolyte, Nov 26 '14 at 19:08
Could you please format your example code properly? And also, what exactly is desired output? — Vyktor, Nov 29 '14 at 14:59
What does this `Unicode texts sizing ` mean in relation to regex? — , Dec 01 '14 at 20:36

Irshad Bhat · Answer 1 · 2014-11-27T15:10:32.917

2

You need to insert every symbol you want to replace in square brackets [], escape some special symbols like [] itself, single quote ' and \. The regex is r'[-0-9{}&+()"=!.?:/|»©><#«,_+;%\[\]@$*\'\\^~\n\t]'.

Demo:

>>> st='1234567890-=[]\;,./\'!@#$%^&*()_+{}|":<>?//.,`~ajshgasd'
>>> print re.sub(r'[-0-9{}&+()"=!.?:/|»©><#«,_+;%\[\]@$*\'\\^`~\n\t]','',st)
ajshgasd

On file:

>>> fp=open('file.txt','r')    
>>> for line in fp:
...     if line.strip() == '': continue  # strip() removes leading and trailing spaces
...     print re.sub(r'[-0-9{}&+()"=!.?:/|»©><#«,_+;%\[\]@$*\'\\^`~]','',line).strip(),
... 
    ന്യൂഡല്‍ഹി സര്‍ക്കാര്‍ജീവനക്കാരായ ഭര്‍ത്താക്കന്മാരുടെ ശമ്പളം

For writing output to a file use this code:

of=open('outfile.txt','w')
fp=open('file.txt','r')
for line in fp:
    if line.strip() == '': continue  # strip() removes leading and trailing spaces
    rline = re.sub(r'[-0-9{}&+()"=!.?:/|»©><#«,_+;%\[\]@$*\'\\^`~]','',line).strip()
    if rline == '': continue # skip empty lines
    of.write(rline+'\n')

of.close()
fp.close()

edited Nov 27 '14 at 15:10

answered Nov 26 '14 at 15:57

Irshad Bhat

8,479
1
26
36

This code remove the symbols but using fp.read() we are trying to load whole file to memory which is not practical for me – Nov 27 '14 at 11:22
1

@karu, `fp.read()` was just to show you that it worked on the sample you provided. You don't have to load the whole file in memory. You need to open file using `fp=open('file.txt','r')` and then iterate over the file line-by-line using `for line in fp: ...` and apply **regex** as provide on each line. This doesn't load the whole file in memory rather it loads a single line at a time. Forget about print `fp.read()` and `fp.seek(0)` part in my code and use the **other three code-lines**. This will work for you. – Irshad Bhat Nov 27 '14 at 11:31
Okay. Let me try what happens if there are empty lines in the file – Nov 27 '14 at 13:05
You can skip empty lines by this condition `if line.strip()=='': continue`. I'll update this in code. – Irshad Bhat Nov 27 '14 at 13:11
The code is working, how to write it a file with out removing space – Nov 27 '14 at 14:58
@karu I've uploaded the code to write output to a file. – Irshad Bhat Nov 27 '14 at 15:12

Mark Tolonen · Accepted Answer · 2014-11-28T08:29:04.433

str.translate can be used instead of re.sub. It takes a mapping of Unicode ordinal to replacement pairs and returns the translated string. If the replacement is None it deletes the characters. str.maketrans can be used to generate the mapping.

In Python 3, also remember to specify the encoding of the files. I used UTF-8 for testing:

#!python3
#coding: utf8
symbols = ' {}&+()"=!.?.:../|»©:><#«,123456789_-+;[]%'
D = str.maketrans('','',symbols)
with open('All12.txt','a',encoding='utf8') as t, open('All11.txt','r',encoding='utf8') as n:
    for line in n:
        t.write(line.translate(D))

Just list whatever symbols you want to delete in symbols.

Alternatively, you can read the file in blocks of characters, which will be more efficient than reading over 10 million lines individually. Read the file in, for example, 20+ 100MB blocks instead.

#!python3
#coding: utf8
symbols = ' {}&+()"=!.?.:../|»©:><#«,123456789_-+;[]%'
D = str.maketrans('','',symbols)
with open('All12.txt','a',encoding='utf8') as t, open('All11.txt','r',encoding='utf8') as n:
    while True:
        block = n.read(100*1024*1024)
        if not block:
            break
        t.write(block.translate(D))

Ref: str.translate, str.maketrans

Which one is faster using regular expression or without ? – Dec 01 '14 at 14:02 — , Dec 01 '14 at 14:02
Probably this one. Time it yourself! – Mark Tolonen Dec 01 '14 at 15:41 — Mark Tolonen, Dec 01 '14 at 15:41

score 0 · Answer 3 · answered Nov 22 '14 at 01:24

The re after the list of symbols between the first [ and ] makes no sense to me. It will not strip symbolds, but will only remove a symbol followed by '1 2 3 4 5 6 7 8 9 _ - + ; [ ] %'. In other work, the re.sub will not do anything. But anyway, your code runs on 3.4.2, Win7.

import re
symbols = re.compile(r'[{} &+( )" =!.?.:.. / |  » © : >< #  «  ,]'
                     '1 2 3 4 5 6 7 8 9 _ - + ; [ ]  %',flags=re.UNICODE)
text = ('''" :621 " :621 :1 ;" _ " :594 :25 4 8 0 :23'''
        '''"സര്‍ക്കാര്‍ജീവനക്കാരുടെ ശമ്പളം അറിയാന്‍'''
        '''ഭാര്യമാര്‍ക്ക് അവകാശമുണ്ട്വിവരാവകാശകമ്മീഷന്‍\n'''
        ''':621 :4 0 3 0 ;" _ " :551 :16 :3 " ''')
data = symbols.sub(" ", text)          
print(data == text)  # True

PS. with statements can have multiple clauses (to save indent levels).

with open('/home/corpus/All12.txt','a') as t,\
     open('/home/corpus/All11.txt', 'r') as n:

score 0 · Answer 4 · answered Nov 26 '14 at 04:35

[{} &+( )" =!.?.:.. / |  » © : >< #  «  , 1 2 3 4 5 6 7 8 9 _ - + ; \[ \]  %]

Try this.Replace by empty string.See demo.

http://regex101.com/r/oE6jJ1/18

import re
p = re.compile(ur'[{} &+( )" =!.?.:.. / | » © : >< # « , 1 2 3 4 5 6 7 8 9 _ - + ; \[ \] %]', re.IGNORECASE | re.UNICODE)
test_str = u" :621 \" :621 :1 ;\" _ \" :594 :25 4 8 0 :23 \"സര്‍ക്കാര്‍ജീവനക്കാരുടെ ശമ്പളം അറിയാന്‍ ഭാര്യമാര്‍ക്ക് അവകാശമുണ്ട്വിവരാവകാശകമ്മീഷന്‍\n:621 :4 0 3 0 ;\" _ \" :551 :16 :3"
subst = u""

result = re.sub(p, subst, test_str)

score 0 · Answer 5 · edited Nov 27 '14 at 07:30

Solution WITHOUT REGEX:

You can use the map function along with a set of symbols you want to remove to accomplish this.

def removeSymbols(text,symbols):
    return "".join(map(lambda x: "" if x in symbols else x,text))

>>> string = '''" :621 \" :621 :1 ;\" _ \" :594 :25 4 8 0 :23 \"സര്‍ക്കാര്‍ജീവനക്കാരുടെ ശമ്പളം അറിയാന്‍ ഭാര്യമാരക് അവകാശമുണ്ട്വിവരാവകാശകമ്മീഷന്‍\n:621 :4 0 3 0 ;\" _ \" :551 :16 :3"'''    

>>> symbols = set('[{} &+( )" =!.?.:.. / |  » © : >< #  «  ,] 1 2 3 4 5 6 7 8 9 _ - + ; [ ]  %')

>>> cleanString = removeSymbols(string,symbols)

>>> print(cleanString)

'" :621 " :621 :1 ;" _ " :594 :25 4 8 0 :23 "സര്\u200dക്കാര്\u200dജീവനക്കാരുടെ ശമ്പളം അറിയാന്\u200d ഭാര്യമാര്\u200dക്ക് അവകാശമുണ്ട്വിവരാവകാശകമ്മീഷന്\u200d\n:621 :4 0 3 0 ;" _ " :551 :16 :3"'

score 0 · Answer 6 · answered Nov 27 '14 at 22:03

I thing your regular expression is not correct since you can simplify it. For example, the sub-expression [{} &+( )" =!.?.:.. / | » © : >< # « ,] can be simplify in [ !"#&()+,./:<=>?{|}©«»]: only keep each character one time. This is because [] is used to indicate a set of characters. Take a look at the chapter "Regular expression operations" in the Python documentation. See: https://docs.python.org/3.4/library/re.html

In the title of your message, you wrote: "Removing symbols from a large unicode text file", so I think that you have a set of characters you want to remove from your file.

To simplify you set of symbols, you can try:

>>> symbols = "".join(frozenset(r'[{} &+( )" =!.?.:.. / |  » © : >< #  «  ,] 1 2 3 4 5 6 7 8 9 _ - + ; [ ]  %'))
>>> print(symbols)
! #"%&)(+-,/.132547698»:=<?>[];_|©{}«

That way you can simply write:

symbols = '! #"%&)(+-,/.132547698»:=<?>[];_|©{}«'

Note for the readers: this is not obvious but all the strings here are unicode strings. I think, the author use Python 3. For Python 2.7 users, the best way is to use the "utf8" encoding and the u"" syntax, that way:
# -*- coding: utf8 -*-
symbols = u'! #"%&)(+-,/.132547698»:=<?>[];_|©{}«'
Alternatively, you can import unicode_literals, and drop the "u" prefix:
# -*- coding: utf8 -*-
from __future__ import unicode_literals
symbols = '! #"%&)(+-,/.132547698»:=<?>[];_|©{}«'

If you want to write a regular expression which match one symbol, you have to escape the characters with specials meanings (for example: "[" should be escaped in "\["). The best way is to use re.escape function.

>>> import re
>>> symbols = '! #"%&)(+-,/.132547698»:=<?>[];_|©{}«'
>>> regex = "[{0}]".format(re.escape(symbols))
>>> print(regex)
[\!\ \#\"\%\&\)\(\+\-\,\/\.132547698\»\:\=\<\?\>\[\]\;\_\|\©\{\}\«]

Just have a try:

import re

symbols = '! #"%&)(+-,/.132547698»:=<?>[];_|©{}«'
regex = "[{0}]+".format(re.escape(symbols))

example = '''" :621 " :621 :1 ;" _ " :594 :25 4 8 0 :23 "സര്‍ക്കാര്‍ജീവനക്കാരുടെ ശമ്പളം അറിയാന്‍ ഭാര്യമാര്‍ക്ക് അവകാശമുണ്ട്വിവരാവകാശകമ്മീഷന്‍
:621 :4 0 3 0 ;" _ " :551 :16 :3 "'''

print(re.sub(regex, "", example, re.UNICODE))

Note that zero isn't in the symbols set but space are, so the result will be:

'''0സര്‍ക്കാര്‍ജീവനക്കാരുടെശമ്പളംഅറിയാന്‍ഭാര്യമാര്‍ക്ക്അവകാശമുണ്ട്വിവരാവകാശകമ്മീഷന്‍
00'''

I think the correct symbols set is: !#"%&)(+-,/.0132547698»:=<?>[];_|©{}«. Then you can strip each line to remove trailing white spaces...

So this code snippet should work for you:

import re

symbols = '!#"%&)(+-,/.0132547698»:=<?>[];_|©{}«'
regex = "[{0}]+".format(re.escape(symbols))
sub_symbols = re.compile(regex, re.UNICODE).sub

with open('/home/corpus/All12.txt', 'a') as t:
    with open('/home/corpus/All11.txt', 'r') as n:
        data = n.readline()
        data = sub_symbols("", data).strip()
        t.write(data)

If you had bytestring instead of unicode string, you could use memory-mapped file support (`mmap`) instead, see: https://docs.python.org/3.4/library/mmap.html. — Laurent LAPORTE, Nov 27 '14 at 22:10

shark3y · Answer 7 · 2014-12-01T20:47:26.250

0

Have you considered decoding the unicode such as:

line = line.decode('utf_8')

then re-encoding to let's say... ascii while ignoring characters it doesn't know such as:

line = line.encode('ascii', 'ignore')

Not sure that's any faster or better. Regular expressions are slow, but I don't know empirically that this is better. It's pretty easy though ;)

Probably O(2n) complexity (combined), but a long regular expression might be just as bad.

UPDATE: This is wrong as pointed out below.

edited Dec 01 '14 at 20:47

answered Dec 01 '14 at 07:30

shark3y

166
6

This won't work. It'll ignore all non-ascii characters which isn't the desired output. – Irshad Bhat Dec 01 '14 at 10:24
We need just opposite. – Dec 01 '14 at 12:07
Sorry I must have read this wrong. You could consider doing the same thing but encoding to an encoding that contains only those characters. Which seems very possible considering they are clearly a specific foreign language. – shark3y Dec 01 '14 at 20:45

Removing symbols from a large unicode text file

7 Answers7