-1

Is there a way to speedup this code regex code? The file is really large and will not open in excel because of size.

import regex as re

path = "C:/Users/.../CDPH/"
with open(path + 'Thefile.tab') as file:
     data = file.read()
     # replace all space bars between tab characters
     data = re.sub('( )*(?=\n)|( )*(?=\t)', '', data )
with open(path + 'Data.csv', 'w') as file:
     file.write(data)
Shane S
  • 1,747
  • 14
  • 31
  • 4
    "Stream" it (e.g using readLine) instead of loading the entire file, then updating the entire thing, then writing the entire thing? e.g. https://stackoverflow.com/questions/8009882/how-to-read-a-large-file-line-by-line – Mike 'Pomax' Kamermans Mar 22 '23 at 21:14
  • 1
    what do you want to achieve? – Jean-François Fabre Mar 22 '23 at 21:16
  • 1
    I've found that the best way to speed up regex is to replace regex with my own parsing. It's never failed me. In what format is the file? Doesn't a parser for that file format already exist? – Ted Lyngmo Mar 22 '23 at 21:16
  • 1
    well, load it in blocks ([`.read()` will take a `size` parameter!](https://docs.python.org/3/library/io.html#io.RawIOBase.read)), especially with such a straightforward replacement .. if you can read it by-lines, that'll be better too, as you can let Python deal with the joints (ie. splitting on a start and end space or similar) – ti7 Mar 22 '23 at 21:17
  • Are you trying to simulate `str.rstrip`? – Andrej Kesely Mar 22 '23 at 21:18
  • 1
    All you really need is `line.lstrip(' ').rstrip('\n').rstrip('\t').lstrip('=')` – inspectorG4dget Mar 22 '23 at 21:19
  • When I try to load the data with pandas I get errors because some of the spacebars end up being in columns that are floats or integers. So What I did is replace all space bars between tab characters. – Shane S Mar 22 '23 at 21:21
  • What is a _"space bar"_ ? Do you simply mean _space_? – Ted Lyngmo Mar 22 '23 at 21:21
  • @TedLyngmo the " " character. I did not mean newline. – Shane S Mar 22 '23 at 21:23
  • So, simply _space_ (or perhaps _whitespace_) then? You'll find the _space bar_ on your keyboard. It's not going to find its way into your files. – Ted Lyngmo Mar 22 '23 at 21:23
  • Again, what's the file format? Instead of inventing your own parser with regex, you can probably use a parser specifically written to deal with these kinds of files very efficiently. – Ted Lyngmo Mar 22 '23 at 21:28
  • @TedLyngmo The file is a tab separated file. It looks like a csv but tab character instead. – Shane S Mar 22 '23 at 21:30
  • Did you try Pythons [`csv`](https://docs.python.org/3/library/csv.html) library? – Ted Lyngmo Mar 22 '23 at 21:31
  • @TedLyngmo no I did not. – Shane S Mar 22 '23 at 21:32
  • Ok, I'd start with that. When you get that going, I'm pretty sure it'll be considerably faster than parsing this huge file using regex - and it'll most probably be less error prone too. – Ted Lyngmo Mar 22 '23 at 21:33
  • 1
    @inspectorG4dget I want to try your recommended solution. Are you saying I would use that line instead of `re.sub('( )*(?=\n)|( )*(?=\t)', '', data )`? – Shane S Mar 22 '23 at 21:35

1 Answers1

1

Not knowing the exact dialect of the tab separated csv file I'm having to take a guess. You'll find a lot more options in the csv library documentation.

Here's what I would try to speed up the right trimming of the fields:

#!/usr/bin/python

import csv

with open('Data.csv', 'w', newline='') as outfile:
    with open('Thefile.tab', newline='') as infile:
        rd = csv.reader(infile, delimiter = '\t')
        wr = csv.writer(outfile, delimiter = '\t')
        for row in rd:
            row = [field.rstrip() for field in row]
            wr.writerow(row)
Ted Lyngmo
  • 93,841
  • 5
  • 60
  • 108
  • 1
    So the results of using this code is 4 time faster than my prior. Good improvement thank you. – Shane S Mar 22 '23 at 23:55