Split a large csv file based on date in First column Python 3.4.3

Question

Ok, So I have found part of the answer I need here in the following Link and it is working as long as my csv files are in a 2015-03-01,1,2,3,1,3 format for the first column. How do I keep this working when the first column is changes to 2015-03-01 00:00:00.000

How to split a huge csv file based on content of first column?

import csv
from itertools import groupby

for key, rows in groupby(csv.reader(open("largeFile.csv", "r", encoding='utf-16')),
                     lambda row: row[0]):
with open("%s.txt" % key, "w") as output:
    for row in rows:
        output.write(",".join(row) + "\n")

So I have one large file with roughly 1.7million rows in it...

2015.01.01,NULL,NULL,NULL,NULL,NULL,0,1,0,1,0,0,0,1
2015.01.01,NULL,NULL,NULL,NULL,NULL,0,1,0,1,0,0,0,1
2015.01.01,NULL,NULL,NULL,NULL,NULL,0,1,0,1,0,0,0,1
2015.01.01,NULL,NULL,NULL,NULL,NULL,0,1,0,1,0,0,0,1
2015.01.01,NULL,NULL,NULL,NULL,NULL,0,1,0,1,0,0,0,1
2015.01.01,NULL,NULL,NULL,NULL,NULL,0,1,0,1,0,0,0,1
2015.01.01,NULL,NULL,NULL,NULL,NULL,0,1,0,1,0,0,0,1
2015.01.02,NULL,NULL,NULL,NULL,NULL,0,1,0,1,0,0,0,1
2015.01.02,NULL,NULL,NULL,NULL,NULL,0,1,0,1,0,0,0,1
2015.01.02,NULL,NULL,NULL,NULL,NULL,0,1,0,1,0,0,0,1
2015.01.02,NULL,NULL,NULL,NULL,NULL,0,1,0,1,0,0,0,1
2015.01.03,NULL,NULL,NULL,NULL,NULL,0,1,0,1,0,0,0,1
2015.01.03,NULL,NULL,NULL,NULL,NULL,0,1,0,1,0,0,0,1

And the program does create a new text document for each day which is great!

But when the columns are as follows, it stops working.

2015-03-01 00:00:01.000,NULL,NULL,NULL,NULL,NULL,0,1,0,1,0,0,0,1

2015-03-01 00:00:02.000,NULL,NULL,NULL,NULL,NULL,0,1,0,1,0,0,0,1

2015-03-02 00:00:01.000,NULL,NULL,NULL,NULL,NULL,0,1,0,1,0,0,0,1

2015-03-02 00:00:02.000,NULL,NULL,NULL,NULL,NULL,0,1,0,1,0,0,0,1

2015-03-02 00:00:03.000,NULL,NULL,NULL,NULL,NULL,0,1,0,1,0,0,0,1

2015-03-03 00:00:01.000,NULL,NULL,NULL,NULL,NULL,0,1,0,1,0,0,0,1

2015-03-03 00:00:02.000,NULL,NULL,NULL,NULL,NULL,0,1,0,1,0,0,0,1

and it gives me the following error.

Traceback (most recent call last): File "C:\Python34\Proj\documents\New folder\dataPullSplit2.py", line 6, in with open("%s.txt" % key, "w") as output: OSError: [Errno 22] Invalid argument: '2015-03-01 00:00:00.000.txt'

Can someone point me in the Right Direction here please.

Found Temp Solution

OK, so by changing it from "w" to "a" I am now appending to the files and by using key[:-13] i was able to cut off the time stamp on the file name... it works... but it is SLOW... how can i improve this and understand why it is going so slow?

Here is the code now

import csv
from itertools import groupby

for key, rows in groupby(csv.reader(open("asdf2.txt", "r", encoding='utf-16')),
                     lambda row: row[0]):

with open("%s.txt" % key[:-13], "a") as output:
    for row in rows:
        output.write(",".join(row) + "\n")

Mike Müller · Accepted Answer · 2017-01-27T08:23:36.703

1

Assuming your files should keep the pattern 2015.01.01, cleaning the key should work:

key = key.split()[0].replace('-', '.')

Full code:

import csv
from itertools import groupby


def shorten_key(key):
    return key.split()[0].replace('-', '.')


for key, rows in groupby(csv.reader(open("asdf2.txt", "r", encoding='utf-16')),
                         lambda row: shorten_key(row[0])):

    with open("%s.txt" % shorten_key(key), "a") as output:
        for row in rows:
            output.write(",".join(row) + "\n")

A quick test:

keys = ['2015-03-01 00:00:02.000',  '2015.01.01']

for key in keys:
    print(key.split()[0].replace('-', '.'))

Output:

2015.03.01
2015.01.01

edited Jan 27 '17 at 08:23

answered Jan 26 '17 at 21:35

Mike Müller

82,630
20
166
161

Well that is Definitely Working as I was hoping it to for now. The only issue I have with it currently, is the actual speed of it... In my file that has the date formatting correct the first time, splitting it into 1 day increments takes a matter of seconds to maybe a minute, It is Roughly 270mb and breaks each file into 86400 rows since that is how many seconds are in each day. In the next file where it also shows the Hours Minutes Seconds .000 etc... splitting just 1 day from it is taking roughly 7-10 minutes. A huge difference. – Christopher Zion Hall Jan 26 '17 at 22:57
1

Looks like it opens and closes the file many time because it generates so many groups. Grouping by the shorten key, i.e. only the date, should help. See my updated answer. – Mike Müller Jan 27 '17 at 08:26
This has sped up the process as expected. I was wondering if that was what it was doing when I first changed the "w" to "a" because I noticed it was just overwriting the same file over and over. – Christopher Zion Hall Jan 27 '17 at 15:03
Now, Once it gets to the end of the file I am getting an Index Error as Follows. So I will need to work on this. Thanks for the assistance with this, it is much appreciated. Traceback (most recent call last): File "C:\Users\hallcz\OneDrive\xFer\Rigs\Ensco\DS-7\Split Files\dataPullSplit3.py", line 13, in for row in rows: File "C:\Users\hallcz\OneDrive\xFer\Rigs\Ensco\DS-7\Split Files\dataPullSplit3.py", line 10, in lambda row: shorten_key(row[0])): IndexError: list index out of range – Christopher Zion Hall Jan 27 '17 at 15:04
1

Looks like an empty line. Just skip it. – Mike Müller Jan 27 '17 at 15:25

Split a large csv file based on date in First column Python 3.4.3

1 Answers1