12

Previously, I had been cleaning out data using the code snippet below

import unicodedata, re, io

all_chars = (unichr(i) for i in xrange(0x110000))
control_chars = ''.join(c for c in all_chars if unicodedata.category(c)[0] == 'C')
cc_re = re.compile('[%s]' % re.escape(control_chars))
def rm_control_chars(s): # see http://www.unicode.org/reports/tr44/#General_Category_Values
    return cc_re.sub('', s)

cleanfile = []
with io.open('filename.txt', 'r', encoding='utf8') as fin:
    for line in fin:
        line =rm_control_chars(line)
        cleanfile.append(line)

There are newline characters in the file that i want to keep.

The following records the time taken for cc_re.sub('', s) to substitute the first few lines (1st column is the time taken and 2nd column is len(s)):

0.275146961212 251
0.672796010971 614
0.178567171097 163
0.200030088425 180
0.236430883408 215
0.343492984772 313
0.317672967911 290
0.160616159439 142
0.0732028484344 65
0.533437013626 468
0.260229110718 236
0.231380939484 204
0.197766065598 181
0.283867120743 258
0.229172945023 208

As @ashwinichaudhary suggested, using s.translate(dict.fromkeys(control_chars)) and the same time taken log outputs:

0.464188098907 252
0.366552114487 615
0.407374858856 164
0.322507858276 181
0.35142993927 216
0.319973945618 314
0.324357032776 291
0.371646165848 143
0.354818105698 66
0.351796150208 469
0.388131856918 237
0.374715805054 205
0.363368988037 182
0.425950050354 259
0.382766962051 209

But the code is really slow for my 1GB of text. Is there any other way to clean out controlled characters?

alvas
  • 115,346
  • 109
  • 446
  • 738
  • why do you keep the whole file in the memory? – Karoly Horvath May 11 '15 at 14:07
  • I need to do other processing later (i need to later select the cleaned sentence based on some criteria and then do even more processing on the selected sentence). Memory isn't an issue. The `re.sub` is a bottleneck – alvas May 11 '15 at 14:09
  • Did you try *not* using regular expression, but just the standard `replace`? REs are good for complicated patterns, but I suspect replace is more efficient for this. Also, I'd try to find a way to divide you original 1GB text into sections - that should also improve the algorithm a *lot*. – jcoppens May 11 '15 at 14:14
  • But I need to do an iteration of the replaces across the set of characters. – alvas May 11 '15 at 14:15
  • Yes, correct. The RE does the same thing, but (prbably) less efficiently – jcoppens May 11 '15 at 14:16
  • so i do something like: `def rp_control_chars(s): for cc in control_chars:; s = s.replace('cc', s);` – alvas May 11 '15 at 14:17
  • 3
    You should try with `str.translate`. – Ashwini Chaudhary May 11 '15 at 14:18
  • Just try it out. But sectioning should be a lot more effective. – jcoppens May 11 '15 at 14:18
  • @ashwinichaudhary, how do i use the `maktetrans` when my target string is always empty? I ran into : `maketrans arguments must have same length`, when i use `string.maktrans(u'\u0081\u0080', '')` – alvas May 11 '15 at 14:24
  • 1
    @alvas For `unicode.translate` this should do it: `s.translate(dict.fromkeys(control_chars))` – Ashwini Chaudhary May 11 '15 at 14:29
  • `s.translate()` It takes on average 0.40923500061 secs per line because it has to iterate through all `control_chars` for each line. that adds up to quite a lot of time (~111 mins) for let's say 1 million lines. – alvas May 11 '15 at 14:38
  • the original regex sub seems faster. Also it seems to be giving me different outputs =( – alvas May 11 '15 at 14:53
  • ah the difference in output is merely the cleaning out of `\n` from the regex method. – alvas May 11 '15 at 15:22
  • @alvas That's not true(http://ideone.com/xGZITp), it looks up the dictionary for each item in the line(string), and that's an O(1) operation. Instead of doing this per line, read the file in chunks, something that can fit in your cache memory. And I forgot you must call `ord()` on the keys(check the ideone example), and make sure **you're not creating that dictionary each time**. – Ashwini Chaudhary May 11 '15 at 16:37

6 Answers6

7

found a solution working character by charater, I bench marked it using a 100K file:

import unicodedata, re, io
from time import time

# This is to generate randomly a file to test the script

from string import lowercase
from random import random

all_chars = (unichr(i) for i in xrange(0x110000))
control_chars = [c for c in all_chars if unicodedata.category(c)[0] == 'C']
chars = (list(u'%s' % lowercase) * 115117) + control_chars

fnam = 'filename.txt'

out=io.open(fnam, 'w')

for line in range(1000000):
    out.write(u''.join(chars[int(random()*len(chars))] for _ in range(600)) + u'\n')
out.close()


# version proposed by alvas
all_chars = (unichr(i) for i in xrange(0x110000))
control_chars = ''.join(c for c in all_chars if unicodedata.category(c)[0] == 'C')
cc_re = re.compile('[%s]' % re.escape(control_chars))
def rm_control_chars(s):
    return cc_re.sub('', s)

t0 = time()
cleanfile = []
with io.open(fnam, 'r', encoding='utf8') as fin:
    for line in fin:
        line =rm_control_chars(line)
        cleanfile.append(line)
out=io.open(fnam + '_out1.txt', 'w')
out.write(''.join(cleanfile))
out.close()
print time() - t0

# using a set and checking character by character
all_chars = (unichr(i) for i in xrange(0x110000))
control_chars = set(c for c in all_chars if unicodedata.category(c)[0] == 'C')
def rm_control_chars_1(s):
    return ''.join(c for c in s if not c in control_chars)

t0 = time()
cleanfile = []
with io.open(fnam, 'r', encoding='utf8') as fin:
    for line in fin:
        line = rm_control_chars_1(line)
        cleanfile.append(line)
out=io.open(fnam + '_out2.txt', 'w')
out.write(''.join(cleanfile))
out.close()
print time() - t0

the output is:

114.625444174
0.0149750709534

I tried on a file of 1Gb (only for the second one) and it lasted 186s.

I also wrote this other version of the same script, slightly faster (176s), and more memory efficient (for very large files not fitting in RAM):

t0 = time()
out=io.open(fnam + '_out5.txt', 'w')
with io.open(fnam, 'r', encoding='utf8') as fin:
    for line in fin:
        out.write(rm_control_chars_1(line))
out.close()
print time() - t0
fransua
  • 1,559
  • 13
  • 30
  • can you give some explanation of `chars = (list(u'%s' % lowercase) * 115117) + control_chars` and also `u''.join(chars[int(random()*len(chars))] for _ in range(600)) + u'\n'`? – alvas May 26 '15 at 14:01
  • yes, the first generates a list of characters with lowercase letters and all control characters (I just multiply by 115117 the lowercase in order to the size of the control chars). I end up with a huge list of chars. And the second part, consists in randomly picking from previous list a given number of characters in order to build up a file. All this is just to produce a file to test how fast, and still accurate, are the functions that remove control characters... I could have removed it from the answer but thought it could help you to check my solution or help others to answer more quickly – fransua May 26 '15 at 22:03
5

As in UTF-8, all control characters are coded in 1 byte (compatible with ASCII) and bellow 32, I suggest this fast piece of code:

#!/usr/bin/python
import sys

ctrl_chars = [x for x in range(0, 32) if x not in (ord("\r"), ord("\n"), ord("\t"))]
filename = sys.argv[1]

with open(filename, 'rb') as f1:
  with open(filename + '.txt', 'wb') as f2:
    b = f1.read(1)
    while b != '':
      if ord(b) not in ctrl_chars:
        f2.write(b)
      b = f1.read(1)

Is it ok enough?

Cyrille Pontvieux
  • 2,356
  • 1
  • 21
  • 29
4

Does this have to be in python? How about cleaning the file before you read it in python to start with. Use sed which will treat it line by line anyway.

See removing control characters using sed.

and if you pipe it out to another file you can open that. I don't know how fast it would be though. You can do it in a shell script and test it. according to this page - sed is 82M characters per second.

Hope it helps.

Community
  • 1
  • 1
Rcynic
  • 392
  • 3
  • 10
3

If you want it to move really fast? Break your input into multiple chunks, wrap up that data munging code as a method, and use Python's multiprocessing package to parallelize it, writing to some common text file. Going character-by-character is the easiest method to crunch stuff like this, but it always takes a while.

https://docs.python.org/3/library/multiprocessing.html

manglano
  • 844
  • 1
  • 7
  • 21
1

I'm surprised no one has mentioned mmap which might just be the right fit here.

Note: I'll put this in as an answer in case it's useful and apologize that I don't have the time to actually test and compare it right now.

You load the file into memory (kind of) and then you can actually run a re.sub() over the object. This helps eliminate the IO bottleneck and allows you to change the bytes in-place before writing it back at once.

After this, then, you can experiment with str.translate() vs re.sub() and also include any further optimisations like double buffering CPU and IO or using multiple CPU cores/threads.

But it'll look something like this;

import mmap

f = open('test.out', 'r')
m = mmap.mmap(f.fileno(), 0, access=mmap.ACCESS_READ)

A nice excerpt from the mmap documentation is;

..You can use mmap objects in most places where strings are expected; for example, you can use the re module to search through a memory-mapped file. Since they’re mutable, you can change a single character by doing obj[index] = 'a',..

Ross
  • 1,013
  • 14
  • 32
0

A couple of things I would try.

First, do the substitution with a replace all regex.

Second, setup a regex char class with known control char ranges instead
of a class of individual control char's.
(This is incase the engine doesn't optimize it to ranges.
A range requires two conditionals on the assembly level,
as opposed to individual conditional on each char in the class)

Third, since you are removing the characters, add a greedy quantifier
after the class. This will negate the necessity to enter into substitution
subroutines after each single char match, instead grabbing all adjacent chars
as needed.

I don't know pythons syntax for regex constructs off the top of my head,
nor all the control codes in Unicode, but the result would look something
like this:

[\u0000-\u0009\u000B\u000C\u000E-\u001F\u007F]+

The largest amount of time would be in copying the results to another string.
The smallest amount of time would be in finding all the control codes, which
would be miniscule.

All things being equal, the regex (as described above) is the fastest way to go.