1

I have a very big 4+ GB size of textfile and I have a script which splits the file into small files based on what characters are before the first coma. eg.: 16,.... line goes to 16.csv, 61,.... line goes to 61.csv. Unfortunately this script runs for ages, I guess because of the write out method. Is there any way to speed up the script?

import pandas as pd
import csv


with open (r"updates//merged_lst.csv",encoding="utf8", errors='ignore') as f:
    r = f.readlines()

for i in range(len(r)):
    row = r[i]
    letter = r[i].split(',')[0]
    filename = r"import//"+letter.upper()+".csv"
    with open(filename,'a',encoding="utf8", errors='ignore') as f:
        f.write(row) 
Tamas Kosa
  • 166
  • 3
  • 18

4 Answers4

5

I'm not sure if this really makes a huge difference, or if the bottleneck is somewhere else, but instead of opening and closing the output file for each line in the input, I would open each output file once and reuse it.

In order to keep multiple files open at the same time and use the correct one for writing, I would put them in a dictionary, using the letter as key:

files = {}

for i in range(len(r)):
    row = r[i]
    letter = r[i].split(',')[0]
    if letter not in files:
        filename = r"import//"+letter.upper()+".csv"
        files[letter] = open(filename,'a',encoding="utf8", errors='ignore')
    f = files[letter]
    f.write(row)

(Instead of checking if letter not in files you could also use files.setdefault.)

You have to close them at the end to ensure that the contents are written to disk. Either do it manually in a loop:

for f in files.values():
    f.close()

Or you can still use a context manager (with statement), see JonSG's answer.

mkrieger1
  • 19,194
  • 5
  • 54
  • 65
  • I agree with this answer, but I do wonder if reading one line at a time, process it and write it would in any way speed it up. – Cow Jan 25 '23 at 14:55
  • I would add on to this that it would be better to simply read the input file line by line and process it immediately rather than `r = f.readlines()`. – Axe319 Jan 25 '23 at 14:56
  • 1
    Also, this is completely anecdotal, but I have a 1 GB text file that I just tested on. `f.readlines()` took roughly 40 seconds. Simply iterating over the file took 10. I suspect the difference is because of the large intermediate `list` that needs to be created. – Axe319 Jan 25 '23 at 15:08
  • Can you try read_csv and set a chunksize? – iohans Jan 25 '23 at 15:10
3

17 opened files does not seem too many to be unmanageable. I would try a dictionary of contexts constructed via contextlib.ExitStack(). This will allow you to have a cleanish and manageable way to keep the output files open so you are not constantly reopening them as that is likely what is making things slow or at he very least not helping.

Note: I simplified your file names just to make things a little easer to see, so you will want to fix that if you try this.

import contextlib

with contextlib.ExitStack() as stack:
    opened_files = {
        "in": stack.enter_context(open("in.csv", "rt", encoding="utf8", errors="ignore"))
    }
    for row in opened_files["in"]:
        key = row.split(",")[0]
        target = opened_files.setdefault(key, open(f"out/{key}.csv", "a", encoding="utf8", errors="ignore"))
        target.write(row)

You could even do this if you wanted to keep the dictionary restricted to just output files:

import contextlib

with contextlib.ExitStack() as stack:
    opened_files = {}
    for row in stack.enter_context(open("in.csv", "rt", encoding="utf8", errors="ignore")):
        key = row.split(",")[0]
        target = opened_files.setdefault(key, open(f"out/{key}.csv", "a", encoding="utf8", errors="ignore"))
        target.write(row)
JonSG
  • 10,542
  • 2
  • 25
  • 36
2

Here are some performance comparisons for your original code, mkrieger1's, my contribution, and iohans's Pandas solution, because: Don't use Pandas to iterate rows.

To benchmark these:

  1. I generated three sample input CSV files, each looks something like this:

    1,5
    8,1
    8,7
    5,10
    9,2
    4,3
    7,6
    10,7
    ...,...
    

    Both columns are random ints from 1 to 10. The three sample CSVs grow in the size of their rows:

    CSV          Row count
    ----   ---------------
    10e5      100_000 rows
    10e6    1_000_000 rows
    10e7   10_000_000 rows
    
  2. I took your code, mkrieger1's, and mine:

    Version   Description                                   
    -------   ----------------------------------------------
    ver1      OP's version, baseline                        
    ver2      mkrieger1's version, with dict of output files
    ver3      my version, use csv module, read incrementally
    ver4      iohans's Pandas solution                      
    

Here are the results:

Version Size Real (s) User (s) Sys (s) Mem (MB)
ver1
10e5 4.04 0.44 1.83 12.19
10e6 23.97 4.29 17.79 81.49
10e7 214.74 43.54 167.35 702.08
ver2
10e5 0.03 0.02 0.0 12.19
10e6 0.2 0.18 0.01 81.91
10e7 1.82 1.7 0.11 702.91
ver3
10e5 0.03 0.03 0.0 7.47
10e6 0.23 0.22 0.0 7.77
10e7 2.3 2.27 0.03 8.56
ver4
10e5 12.5 11.05 3.28 54.85
10e6 121.89 99.2 24.54 80.81
10e7 1020+ N/A N/A N/A
  • yours (ver1) spends a lot of time making system calls (presumably to re-open the output files); I also suspect that some (enough?) time is taken to seek to the end of each file so that "append" works

  • mkrieger1's (ver2) offers the biggest performance improvement in time. By only opening the output files once, that version spends a lot less time in sys calls. Curious, though, that the sys calls are scaling up.

  • mine (ver3) offers the biggest improvement in memory; if you have limited memory, I suspect the previous versions will also impose a time penalty as over-pressured memory is paged to disk.

    How do we get next-to-zero memory usage?

    As some comments have suggested, by incrementally reading the input and deciding what to do for each row as it's read:

    import csv
    
    def get_fname(num: str) -> str:
        return f"output_{num:>02}.csv"
    
    out_files = []  # keep track of opened files, to close at end
    writers = {}  # csv writers for individually numbered output files
    
    with open(input_csv, newline="") as f_in:
        reader = csv.reader(f_in)
    
        for row in reader:
            num = row[0]
            if num not in writers:
                f = open(get_fname(num), "w", newline="")
                out_files.append(f)
                writers[num] = csv.writer(f, lineterminator="\n")
    
            writers[num].writerow(row)
    
    for f in out_files:
        f.close()
    

    I'm also using the csv module because it can correctly handle the CSV format. If you know 100% that your CSV file does not have embedded newlines, like:

    Co1,Col2,Col3
    16,"A column, with 
    a newline",98.6
    

    then you could get away with just reading/writing line-by-line and splitting on the first comma you see. But, even if you knew that, there's not much of a time penalty for just using the csv module anyways.

  • "the Pandas solution" (ver4) will never be performant for iterating rows, Don't use Pandas to iterate rows. I aborted the 10e7 run after 17 minutes... the trend indicates that it probably would have finished in ~20 minutes. Perhaps a better tuned Pandas solution could do better than 6x slower, but it'll never beat reading a CSV row-by-row just as text or with the csv module.

Zach Young
  • 10,137
  • 4
  • 32
  • 53
-1

Reading sections of the files and using to_csv will speed this script up. This example reads the big file 500,000 lines at a time.

import pandas as pd
    
r = pd.read_csv(r"updates//merged_lst.csv", chunksize=500000, encoding="utf8", errors='ignore')

for chunk in r:
    for index, row in chunk.iterrows():
        letter = row[0].split(',')[0]
        f = r"import//"+letter.upper()+".csv"
        chunk.loc[index:index].to_csv(f, mode='a', header=False, index=False)
iohans
  • 838
  • 1
  • 7
  • 15
  • This runs about 6x slower than OP's (see timings in my table). There is also one definite bug and one maybe-bug. The maybe-bug: OP wasn't clear that the first column is non-numeric, but their examples, 16 and 61, are numeric and if they were only numeric the first column in the df will have a numpy.int64 type which cannot be split() or upper()-ed. The definite-bug: as-is, read_csv() will consume the first row of the input as a header and mess up the results. – Zach Young Jan 25 '23 at 22:03
  • Can definitely fix the header bug but setting to ignore the header, but I am so glad you tested it. I will take this approach off the table! – iohans Jan 26 '23 at 03:16