Ways to speedup regex and make it faster

Question

Is there a way to speedup this code regex code? The file is really large and will not open in excel because of size.

import regex as re

path = "C:/Users/.../CDPH/"
with open(path + 'Thefile.tab') as file:
     data = file.read()
     # replace all space bars between tab characters
     data = re.sub('( )*(?=\n)|( )*(?=\t)', '', data )
with open(path + 'Data.csv', 'w') as file:
     file.write(data)

"Stream" it (e.g using readLine) instead of loading the entire file, then updating the entire thing, then writing the entire thing? e.g. https://stackoverflow.com/questions/8009882/how-to-read-a-large-file-line-by-line — Mike 'Pomax' Kamermans, Mar 22 '23 at 21:14
I've found that the best way to speed up regex is to replace regex with my own parsing. It's never failed me. In what format is the file? Doesn't a parser for that file format already exist? — Ted Lyngmo, Mar 22 '23 at 21:16
well, load it in blocks ([`.read()` will take a `size` parameter!](https://docs.python.org/3/library/io.html#io.RawIOBase.read)), especially with such a straightforward replacement .. if you can read it by-lines, that'll be better too, as you can let Python deal with the joints (ie. splitting on a start and end space or similar) — ti7, Mar 22 '23 at 21:17
All you really need is `line.lstrip(' ').rstrip('\n').rstrip('\t').lstrip('=')` — inspectorG4dget, Mar 22 '23 at 21:19
When I try to load the data with pandas I get errors because some of the spacebars end up being in columns that are floats or integers. So What I did is replace all space bars between tab characters. — Shane S, Mar 22 '23 at 21:21
So, simply _space_ (or perhaps _whitespace_) then? You'll find the _space bar_ on your keyboard. It's not going to find its way into your files. — Ted Lyngmo, Mar 22 '23 at 21:23
Again, what's the file format? Instead of inventing your own parser with regex, you can probably use a parser specifically written to deal with these kinds of files very efficiently. — Ted Lyngmo, Mar 22 '23 at 21:28
@TedLyngmo The file is a tab separated file. It looks like a csv but tab character instead. — Shane S, Mar 22 '23 at 21:30
Did you try Pythons [`csv`](https://docs.python.org/3/library/csv.html) library? — Ted Lyngmo, Mar 22 '23 at 21:31
Ok, I'd start with that. When you get that going, I'm pretty sure it'll be considerably faster than parsing this huge file using regex - and it'll most probably be less error prone too. — Ted Lyngmo, Mar 22 '23 at 21:33
@inspectorG4dget I want to try your recommended solution. Are you saying I would use that line instead of `re.sub('( )*(?=\n)|( )*(?=\t)', '', data )`? — Shane S, Mar 22 '23 at 21:35

Ted Lyngmo · Accepted Answer · 2023-03-22T22:05:47.773

1

Not knowing the exact dialect of the tab separated csv file I'm having to take a guess. You'll find a lot more options in the csv library documentation.

Here's what I would try to speed up the right trimming of the fields:

#!/usr/bin/python

import csv

with open('Data.csv', 'w', newline='') as outfile:
    with open('Thefile.tab', newline='') as infile:
        rd = csv.reader(infile, delimiter = '\t')
        wr = csv.writer(outfile, delimiter = '\t')
        for row in rd:
            row = [field.rstrip() for field in row]
            wr.writerow(row)

edited Mar 22 '23 at 22:05

answered Mar 22 '23 at 21:57

Ted Lyngmo

93,841
5
60
108

1

So the results of using this code is 4 time faster than my prior. Good improvement thank you. – Shane S Mar 22 '23 at 23:55

Ways to speedup regex and make it faster

1 Answers1