Split Large csv File into multiple files depending on row value python

Question

Background

I have a large csv file in a specific format (NEM12) which is too big to work with. the file format is as below;

The file always starts with 100
Rows with a 200 represent the start of a new dataset
Rows with 300 or 400 represent data for the dataset
The File always ends with a 900

Example Below

100 NEM12               
200 NMI INFO    INFO        
300 20211001    0   0   0   0
400 20  20  F17     
300 20211002    0   0   0   0
300 20211003    0   0   0   0
200 NMI INFO    INFO        
300 20211001    0   0   0   0
300 20211002    0   0   0   0
300 20211003    0   0   0   0
300 20211004    0   0   0   0
300 20211005    0   0   0   0
…                   
200 NMI INFO    INFO        
300 20211001    0   0   0   0
300 20211002    0   0   0   0
400 20  20  F17     
300 20211003    0   0   0   0
300 20211004    0   0   0   0
900

What I'm Trying to do

I am trying to split the large file, into hundreds of smaller files. Each smaller file would contain a single 200 row with the corresponding 300 and 400 rows worth of values.

What I've tried

I've tried to read the file in via pandas, but due to its irregular shape, that hasn't worked.

I've successfully been able to iterate through lines via the below code, but it splits each value into its own column (i.e. instead of 200, it goes 2, 0, 0).

Any help be appreciated.

for line in open(test):
    if left(line, 3) == '200':
        try:
            with open(fname, 'a', newline='') as f_object:
                writer_object = writer(f_object)
                writer_object.writerow('900')
            f_object.close()
        except NameError:
            print('ignore')
        fname = str(line.replace(',', '').replace('\n', '')) + '.csv'
        with open(fname, 'w', newline='') as f_object:
            writer_object = writer(f_object)
            writer_object.writerow('100')
            writer_object.writerow(line)
    if left(line, 3) == '300' or left(line, 3) == '400':
        with open(fname, 'a', newline='') as f_object:
            writer_object = writer(f_object)
            writer_object.writerow(line)

https://towardsdatascience.com/why-and-how-to-use-pandas-with-large-data-9594dda2ea4c and https://www.geeksforgeeks.org/split-pandas-dataframe-by-rows/ and https://stackoverflow.com/questions/21800169/python-pandas-get-index-of-rows-which-column-matches-certain-valuemight be starters - you can find the indexes of rows with `200` and use `iloc` - I don't know much further than that, but it's a start. — Larry the Llama, Dec 07 '21 at 05:47
hi @ferdy yep, its comma delimited (i,e, 200, NMI, INFO, INFO). but in my code, when I am writing the lines to csv, it is writing one character per cell (2, 0, 0, N, M, I, ...) — Bobby Heyer, Dec 07 '21 at 06:13
Right I got it, I also read the tech info of [nem12](https://www.yurika.com.au/__data/assets/pdf_file/0004/862060/NEM12-Fact-Sheet.pdf). I will write function to possibly solve this issue. — ferdy, Dec 07 '21 at 06:19

ferdy · Accepted Answer · 2021-12-07T11:09:40.990

3

Here is one approach.

fn = 'NEM12#000000000000001#CNRGYMDP#NEMMCO.csv'

cnt = 0
outfn = f'out_{cnt}.csv'

with open(fn, 'r') as f:
    for line in f:
        if line.startswith('100,'):  # don't write
            continue
        elif line.startswith('900'):  # don't write
            continue
        elif line.startswith('200,'):  # write detect start
            cnt += 1
            outfn = f'out_{cnt}.csv'  # new filename
            
        if line.startswith(('200,', '300,', '400,')):
            with open(outfn, 'a') as w:  # write
                w.write(f'{line}'):

The output will be out_1.csv, out_2.csv etc

edited Dec 07 '21 at 11:09

answered Dec 07 '21 at 06:41

ferdy

4,396
2
4
16

This, except it's not obvious from the code where the lines starting with `300` or `400` are processed. I would add a comment before the final `with open(` that any other lines 200/300/400 are appended to the current new output file. – BdR Dec 07 '21 at 10:21
Thanks for the hint, I revised the code to strictly write on lines 200, 300 and 400. – ferdy Dec 07 '21 at 11:05

Bobby Heyer · Answer 2 · 2021-12-07T23:12:02.220

Thanks to @Ferdy for the assistance,

Using the code you provided, as well as my original code, I was able to solve the issue

from csv import writer

for line in open(test):
    if line.startswith('200'):
        try:
            with open(fname, 'a', newline='') as f:
                w = writer(f)
                w.writerow(['900'])
            f.close()
        except NameError:
            print('ignore')
        flist = [str(line).split(",")[x] for x in [1, 3, 6, 7, 8]]
        fname = '_'.join(flist) + '.csv'
        print(fname)
        with open(fname, 'w', newline='') as f:
            w = writer(f)
            w.writerow(['100', 'NEM12', 'DATECREATED', 'MDYMDP', 'NAME'])
            w.writerow(str(line).split(","))
    if line.startswith(('300,', '400,')):
        with open(fname, 'a', newline='') as f:
            w = writer(f)
            w.writerow(str(line).split(","))

Split Large csv File into multiple files depending on row value python

2 Answers2