Speed up python split process

Question

I have a very big 4+ GB size of textfile and I have a script which splits the file into small files based on what characters are before the first coma. eg.: 16,.... line goes to 16.csv, 61,.... line goes to 61.csv. Unfortunately this script runs for ages, I guess because of the write out method. Is there any way to speed up the script?

import pandas as pd
import csv


with open (r"updates//merged_lst.csv",encoding="utf8", errors='ignore') as f:
    r = f.readlines()

for i in range(len(r)):
    row = r[i]
    letter = r[i].split(',')[0]
    filename = r"import//"+letter.upper()+".csv"
    with open(filename,'a',encoding="utf8", errors='ignore') as f:
        f.write(row)

I understand that, but how can I do that? I expect to write out around a dozens of csv. How can I add the new line to the specific csv without opening it? — Tamas Kosa, Jan 25 '23 at 14:44

mkrieger1 · Accepted Answer · 2023-01-25T16:04:53.710

5

I'm not sure if this really makes a huge difference, or if the bottleneck is somewhere else, but instead of opening and closing the output file for each line in the input, I would open each output file once and reuse it.

In order to keep multiple files open at the same time and use the correct one for writing, I would put them in a dictionary, using the letter as key:

files = {}

for i in range(len(r)):
    row = r[i]
    letter = r[i].split(',')[0]
    if letter not in files:
        filename = r"import//"+letter.upper()+".csv"
        files[letter] = open(filename,'a',encoding="utf8", errors='ignore')
    f = files[letter]
    f.write(row)

(Instead of checking if letter not in files you could also use files.setdefault.)

You have to close them at the end to ensure that the contents are written to disk. Either do it manually in a loop:

for f in files.values():
    f.close()

Or you can still use a context manager (with statement), see JonSG's answer.

edited Jan 25 '23 at 16:04

answered Jan 25 '23 at 14:52

mkrieger1

19,194
5
54
65

I agree with this answer, but I do wonder if reading one line at a time, process it and write it would in any way speed it up. – Cow Jan 25 '23 at 14:55
I would add on to this that it would be better to simply read the input file line by line and process it immediately rather than `r = f.readlines()`. – Axe319 Jan 25 '23 at 14:56
1

Also, this is completely anecdotal, but I have a 1 GB text file that I just tested on. `f.readlines()` took roughly 40 seconds. Simply iterating over the file took 10. I suspect the difference is because of the large intermediate `list` that needs to be created. – Axe319 Jan 25 '23 at 15:08
Can you try read_csv and set a chunksize? – iohans Jan 25 '23 at 15:10

score 3 · Answer 2 · answered Jan 25 '23 at 15:16

17 opened files does not seem too many to be unmanageable. I would try a dictionary of contexts constructed via contextlib.ExitStack(). This will allow you to have a cleanish and manageable way to keep the output files open so you are not constantly reopening them as that is likely what is making things slow or at he very least not helping.

Note: I simplified your file names just to make things a little easer to see, so you will want to fix that if you try this.

import contextlib

with contextlib.ExitStack() as stack:
    opened_files = {
        "in": stack.enter_context(open("in.csv", "rt", encoding="utf8", errors="ignore"))
    }
    for row in opened_files["in"]:
        key = row.split(",")[0]
        target = opened_files.setdefault(key, open(f"out/{key}.csv", "a", encoding="utf8", errors="ignore"))
        target.write(row)

You could even do this if you wanted to keep the dictionary restricted to just output files:

import contextlib

with contextlib.ExitStack() as stack:
    opened_files = {}
    for row in stack.enter_context(open("in.csv", "rt", encoding="utf8", errors="ignore")):
        key = row.split(",")[0]
        target = opened_files.setdefault(key, open(f"out/{key}.csv", "a", encoding="utf8", errors="ignore"))
        target.write(row)

Zach Young · Answer 3 · 2023-01-25T23:09:26.320

Here are some performance comparisons for your original code, mkrieger1's, my contribution, and iohans's Pandas solution, because: Don't use Pandas to iterate rows.

To benchmark these:

I generated three sample input CSV files, each looks something like this:
```
1,5
8,1
8,7
5,10
9,2
4,3
7,6
10,7
...,...
```
Both columns are random ints from 1 to 10. The three sample CSVs grow in the size of their rows:
```
CSV          Row count
----   ---------------
10e5      100_000 rows
10e6    1_000_000 rows
10e7   10_000_000 rows
```

I took your code, mkrieger1's, and mine:

Version   Description                                   
-------   ----------------------------------------------
ver1      OP's version, baseline                        
ver2      mkrieger1's version, with dict of output files
ver3      my version, use csv module, read incrementally
ver4      iohans's Pandas solution

Here are the results:

Version	Size	Real (s)	User (s)	Sys (s)	Mem (MB)
ver1
	10e5	4.04	0.44	1.83	12.19
	10e6	23.97	4.29	17.79	81.49
	10e7	214.74	43.54	167.35	702.08
ver2
	10e5	0.03	0.02	0.0	12.19
	10e6	0.2	0.18	0.01	81.91
	10e7	1.82	1.7	0.11	702.91
ver3
	10e5	0.03	0.03	0.0	7.47
	10e6	0.23	0.22	0.0	7.77
	10e7	2.3	2.27	0.03	8.56
ver4
	10e5	12.5	11.05	3.28	54.85
	10e6	121.89	99.2	24.54	80.81
	10e7	1020+	N/A	N/A	N/A

yours (ver1) spends a lot of time making system calls (presumably to re-open the output files); I also suspect that some (enough?) time is taken to seek to the end of each file so that "append" works
mkrieger1's (ver2) offers the biggest performance improvement in time. By only opening the output files once, that version spends a lot less time in sys calls. Curious, though, that the sys calls are scaling up.
mine (ver3) offers the biggest improvement in memory; if you have limited memory, I suspect the previous versions will also impose a time penalty as over-pressured memory is paged to disk.

How do we get next-to-zero memory usage?

As some comments have suggested, by incrementally reading the input and deciding what to do for each row as it's read:
```
import csv

def get_fname(num: str) -> str:
    return f"output_{num:>02}.csv"

out_files = []  # keep track of opened files, to close at end
writers = {}  # csv writers for individually numbered output files

with open(input_csv, newline="") as f_in:
    reader = csv.reader(f_in)

    for row in reader:
        num = row[0]
        if num not in writers:
            f = open(get_fname(num), "w", newline="")
            out_files.append(f)
            writers[num] = csv.writer(f, lineterminator="\n")

        writers[num].writerow(row)

for f in out_files:
    f.close()
```
I'm also using the csv module because it can correctly handle the CSV format. If you know 100% that your CSV file does not have embedded newlines, like:
```
Co1,Col2,Col3
16,"A column, with 
a newline",98.6
```
then you could get away with just reading/writing line-by-line and splitting on the first comma you see. But, even if you knew that, there's not much of a time penalty for just using the csv module anyways.
"the Pandas solution" (ver4) will never be performant for iterating rows, Don't use Pandas to iterate rows. I aborted the 10e7 run after 17 minutes... the trend indicates that it probably would have finished in ~20 minutes. Perhaps a better tuned Pandas solution could do better than 6x slower, but it'll never beat reading a CSV row-by-row just as text or with the csv module.

score -1 · Answer 4 · answered Jan 25 '23 at 14:51

-1

Reading sections of the files and using to_csv will speed this script up. This example reads the big file 500,000 lines at a time.

import pandas as pd
    
r = pd.read_csv(r"updates//merged_lst.csv", chunksize=500000, encoding="utf8", errors='ignore')

for chunk in r:
    for index, row in chunk.iterrows():
        letter = row[0].split(',')[0]
        f = r"import//"+letter.upper()+".csv"
        chunk.loc[index:index].to_csv(f, mode='a', header=False, index=False)

answered Jan 25 '23 at 14:51

iohans

838
1
7
15

This runs about 6x slower than OP's (see timings in my table). There is also one definite bug and one maybe-bug. The maybe-bug: OP wasn't clear that the first column is non-numeric, but their examples, 16 and 61, are numeric and if they were only numeric the first column in the df will have a numpy.int64 type which cannot be split() or upper()-ed. The definite-bug: as-is, read_csv() will consume the first row of the input as a header and mess up the results. – Zach Young Jan 25 '23 at 22:03
Can definitely fix the header bug but setting to ignore the header, but I am so glad you tested it. I will take this approach off the table! – iohans Jan 26 '23 at 03:16

Speed up python split process

4 Answers4