1

Background

I have a large csv file in a specific format (NEM12) which is too big to work with. the file format is as below;

  • The file always starts with 100
  • Rows with a 200 represent the start of a new dataset
  • Rows with 300 or 400 represent data for the dataset
  • The File always ends with a 900

Example Below

100 NEM12               
200 NMI INFO    INFO        
300 20211001    0   0   0   0
400 20  20  F17     
300 20211002    0   0   0   0
300 20211003    0   0   0   0
200 NMI INFO    INFO        
300 20211001    0   0   0   0
300 20211002    0   0   0   0
300 20211003    0   0   0   0
300 20211004    0   0   0   0
300 20211005    0   0   0   0
…                   
200 NMI INFO    INFO        
300 20211001    0   0   0   0
300 20211002    0   0   0   0
400 20  20  F17     
300 20211003    0   0   0   0
300 20211004    0   0   0   0
900 

What I'm Trying to do

I am trying to split the large file, into hundreds of smaller files. Each smaller file would contain a single 200 row with the corresponding 300 and 400 rows worth of values.

What I've tried

I've tried to read the file in via pandas, but due to its irregular shape, that hasn't worked.

I've successfully been able to iterate through lines via the below code, but it splits each value into its own column (i.e. instead of 200, it goes 2, 0, 0).

Any help be appreciated.

for line in open(test):
    if left(line, 3) == '200':
        try:
            with open(fname, 'a', newline='') as f_object:
                writer_object = writer(f_object)
                writer_object.writerow('900')
            f_object.close()
        except NameError:
            print('ignore')
        fname = str(line.replace(',', '').replace('\n', '')) + '.csv'
        with open(fname, 'w', newline='') as f_object:
            writer_object = writer(f_object)
            writer_object.writerow('100')
            writer_object.writerow(line)
    if left(line, 3) == '300' or left(line, 3) == '400':
        with open(fname, 'a', newline='') as f_object:
            writer_object = writer(f_object)
            writer_object.writerow(line)
Bobby Heyer
  • 531
  • 5
  • 18
  • https://towardsdatascience.com/why-and-how-to-use-pandas-with-large-data-9594dda2ea4c and https://www.geeksforgeeks.org/split-pandas-dataframe-by-rows/ and https://stackoverflow.com/questions/21800169/python-pandas-get-index-of-rows-which-column-matches-certain-valuemight be starters - you can find the indexes of rows with `200` and use `iloc` - I don't know much further than that, but it's a start. – Larry the Llama Dec 07 '21 at 05:47
  • Does your original csv file has a comma between fields? – ferdy Dec 07 '21 at 06:02
  • hi @ferdy yep, its comma delimited (i,e, 200, NMI, INFO, INFO). but in my code, when I am writing the lines to csv, it is writing one character per cell (2, 0, 0, N, M, I, ...) – Bobby Heyer Dec 07 '21 at 06:13
  • Right I got it, I also read the tech info of [nem12](https://www.yurika.com.au/__data/assets/pdf_file/0004/862060/NEM12-Fact-Sheet.pdf). I will write function to possibly solve this issue. – ferdy Dec 07 '21 at 06:19

2 Answers2

3

Here is one approach.

fn = 'NEM12#000000000000001#CNRGYMDP#NEMMCO.csv'

cnt = 0
outfn = f'out_{cnt}.csv'

with open(fn, 'r') as f:
    for line in f:
        if line.startswith('100,'):  # don't write
            continue
        elif line.startswith('900'):  # don't write
            continue
        elif line.startswith('200,'):  # write detect start
            cnt += 1
            outfn = f'out_{cnt}.csv'  # new filename
            
        if line.startswith(('200,', '300,', '400,')):
            with open(outfn, 'a') as w:  # write
                w.write(f'{line}'):

The output will be out_1.csv, out_2.csv etc

ferdy
  • 4,396
  • 2
  • 4
  • 16
  • This, except it's not obvious from the code where the lines starting with `300` or `400` are processed. I would add a comment before the final `with open(` that any other lines 200/300/400 are appended to the current new output file. – BdR Dec 07 '21 at 10:21
  • Thanks for the hint, I revised the code to strictly write on lines 200, 300 and 400. – ferdy Dec 07 '21 at 11:05
1

Thanks to @Ferdy for the assistance,

Using the code you provided, as well as my original code, I was able to solve the issue

from csv import writer

for line in open(test):
    if line.startswith('200'):
        try:
            with open(fname, 'a', newline='') as f:
                w = writer(f)
                w.writerow(['900'])
            f.close()
        except NameError:
            print('ignore')
        flist = [str(line).split(",")[x] for x in [1, 3, 6, 7, 8]]
        fname = '_'.join(flist) + '.csv'
        print(fname)
        with open(fname, 'w', newline='') as f:
            w = writer(f)
            w.writerow(['100', 'NEM12', 'DATECREATED', 'MDYMDP', 'NAME'])
            w.writerow(str(line).split(","))
    if line.startswith(('300,', '400,')):
        with open(fname, 'a', newline='') as f:
            w = writer(f)
            w.writerow(str(line).split(","))
Bobby Heyer
  • 531
  • 5
  • 18