0

I have a csv file that contain only three columns but more than 200K rows. I want to split the csv file to multiple csv files depend on the 2nd columns (time columns) so same columns in each file but less rows (depend on my specification). I want the time duration to be variable, like I can put 10 second reading to every single file or 15 second or 19 second. I tried a couple of codes to split the csv file but I was not successful as I am very new to python.

The input csv file will be like this:

Col 0       Col 1       Col 2       Col 3
Data YYY    12:40:05    Data XXX
Data YYY    12:40:06    Data XXX
Data YYY    12:40:07    Data XXX
Data YYY    12:40:08    Data XXX
Data YYY    12:40:09    Data XXX
Data YYY    12:40:10    Data XXX
Data YYY    12:40:11    Data XXX
Data YYY    12:40:12    Data XXX
Data YYY    12:40:13    Data XXX

The output csv file, I want to be: file1

Col 0       Col 1       Col 2       Col 3
Data YYY    12:40:05    Data XXX
Data YYY    12:40:06    Data XXX
Data YYY    12:40:07    Data XXX

file2

Col 0       Col 1       Col 2       Col 3
Data YYY    12:40:08    Data XXX
Data YYY    12:40:09    Data XXX
Data YYY    12:40:10    Data XXX

file3

Col 0       Col 1       Col 2       Col 3
Data YYY    12:40:11    Data XXX
Data YYY    12:40:12    Data XXX
Data YYY    12:40:13    Data XXX

and so on till the end (the variable in above is equal to 3 seconds). My python code is:

    import csv
    from datetime import datetime

    fieldnames = ['Col 0', 'Hour', 'Minute' , 'Second', 'Col 2' , 'Col 3']

    files = {}
    writers = {}
    seconds = []

    with open('4_Columns_PRi_Output.csv') as csvfile:
        reader = csv.DictReader(csvfile)
        for row in reader:
            output_row = {}
            output_row['Col 0'] = row['Col 0']
            change_date = datetime.strptime(row['Col 1'].split(',')[0], '%H:%M:%S')
            output_row['Hour'] = change_date.strftime('%H')
            output_row['Minute'] = change_date.strftime('%M')
            sec = change_date.strftime('%S')
            output_row['Second'] = sec

            if sec not in seconds:
                output_file = open('corrected'+str(sec)+".csv", 'w')
                writer = csv.DictWriter(output_file, fieldnames=fieldnames,lineterminator='\n')
                writer.writeheader()
                files[sec] = output_file
                writers[sec] = writer
                seconds.append(sec)
            else:
                output_file = open('corrected'+str(sec)+".csv", 'w+')
                writer = csv.DictWriter(output_file, fieldnames=fieldnames,lineterminator='\n')
            output_row['Col 2'] = row['Col 2']
            output_row['Col 3'] = row['Col 3'].strip()
            writers[sec].writerow(output_row)

    for key in files:
        files[key].close()

Your help will be highly appreciated.

Martin Evans
  • 45,791
  • 17
  • 81
  • 97
Hawzhin
  • 13
  • 3
  • Your code processes a **comma** separated values file while you show input and output files without any commas. Please show coherent files and code. – Serge Ballesta Dec 01 '17 at 18:26

3 Answers3

0

In Python, you can compare datetimes as you would ints. Something like

>>> this_morning = datetime.datetime(2009, 12, 2, 9, 30)
>>> last_night = datetime.datetime(2009, 12, 1, 20, 0)
>>> this_morning.time() < last_night.time()

Would resolve as True. Source.

You can also add to (or subtract from) datetimes. Example:

import datetime
a = datetime.datetime(100,1,1,11,34,59)
b = a + datetime.timedelta(seconds=3)

Which when printed outputs 11:34:59 and 11:35:02. Source.

So, when you are writing your csv file, keep a list of the datetime objects you are putting in. For the first datetime in the list, add N seconds to it using maxTime = firstTime + datetime.timedelta(seconds=N). As you construct the list, check that thisTime <= maxTime. If that resolves False, start a new file and do it all over again on that file.

kirkpatt
  • 625
  • 5
  • 21
0

Please read the comments throughout the code for explanation.

In essence I have 3 Methods:

  • t() gives the whole text as blob like you provided it.
  • partTuples(tupleList, secs) partitions your preprocess csv list according to secs
  • dtFromString(s) a helper to parse HH:MM:SS into a datetime object
  • a list comprehension to preprocess your csv data (given by t())

from datetime import datetime
from datetime import timedelta 
from datetime import date 

def t() :
# spacing
#        1         2         3         4 
#234567890123456789012345678901234567890123
    return '''
Col 0       Col 1       Col 2       Col 3
Data YYY    12:40:05    Data XXX
Data YYY    12:40:06    Data XXX
Data YYY    12:40:07    Data XXX
Data YYY    12:40:08    Data XXX
Data YYY    12:40:09    Data XXX
Data YYY    12:40:10    Data XXX
Data YYY    12:40:11    Data XXX
Data YYY    12:40:12    Data XXX
Data YYY    12:40:13    Data XXX 
'''

# splits parsed lines into arrays that contain one files content
def partTuples(tupelList, secs):
    rv = []
    oneSet = []
    doneTime = None

    for t in tupelList:
        myTime = t[1].time()

        if doneTime == None: 
            doneTime = (datetime.combine(date.today(), myTime) + timedelta(seconds=secs)).time()

        if myTime <= doneTime:
            oneSet.append(t[:])

        elif (myTime > doneTime):
            rv.append(oneSet[:]) # copy
            oneSet = []
            oneSet.append(t[:]) 
            doneTime = (datetime.combine(date.today(), myTime) + timedelta(seconds=secs)).time()

    if len(oneSet) > 0:
        rv.append(oneSet[:])

    return rv


def dtFromString(s):
    splitted = s.split(":")
    hh = int(splitted[0])
    mm = int(splitted[1])
    ss = int(splitted[2])
    return datetime.combine(date.today(), datetime(2000,1,1,hh, mm,ss).time())

# parses your files data into a list, parses a datetime object from text
# if you have csv with , seperation instead of the above printed fixed column
# length data - you need to adapt this
# I did not bother to parse the Col 3 as its empty anyway - adapt that as well
tpls = [ (x[0:8].strip(), dtFromString(x[9:20]), x[21:].strip(),"") for x in t().splitlines() if len(x.strip()) > 0 and not "Col" in x]


# print parsed file
print()
print(tpls)

# print splittet content - empty line == new file          
print() 
for fileCont in partTuples(tpls,3):
    for parts in fileCont:
        print(parts)     
    print()

Output:

[('Data YYY', datetime.datetime(2017, 12, 1, 12, 40, 5), 'Data XXX', ''), ('Data
 YYY', datetime.datetime(2017, 12, 1, 12, 40, 6), 'Data XXX', ''), ('Data YYY',
datetime.datetime(2017, 12, 1, 12, 40, 7), 'Data XXX', ''), ('Data YYY', datetim
e.datetime(2017, 12, 1, 12, 40, 8), 'Data XXX', ''), ('Data YYY', datetime.datet
ime(2017, 12, 1, 12, 40, 9), 'Data XXX', ''), ('Data YYY', datetime.datetime(201
7, 12, 1, 12, 40, 10), 'Data XXX', ''), ('Data YYY', datetime.datetime(2017, 12,
 1, 12, 40, 11), 'Data XXX', ''), ('Data YYY', datetime.datetime(2017, 12, 1, 12
, 40, 12), 'Data XXX', ''), ('Data YYY', datetime.datetime(2017, 12, 1, 12, 40,
13), 'Data XXX', '')]

('Data YYY', datetime.datetime(2017, 12, 1, 12, 40, 5), 'Data XXX', '')
('Data YYY', datetime.datetime(2017, 12, 1, 12, 40, 6), 'Data XXX', '')
('Data YYY', datetime.datetime(2017, 12, 1, 12, 40, 7), 'Data XXX', '')
('Data YYY', datetime.datetime(2017, 12, 1, 12, 40, 8), 'Data XXX', '')

('Data YYY', datetime.datetime(2017, 12, 1, 12, 40, 9), 'Data XXX', '')
('Data YYY', datetime.datetime(2017, 12, 1, 12, 40, 10), 'Data XXX', '')
('Data YYY', datetime.datetime(2017, 12, 1, 12, 40, 11), 'Data XXX', '')
('Data YYY', datetime.datetime(2017, 12, 1, 12, 40, 12), 'Data XXX', '')

('Data YYY', datetime.datetime(2017, 12, 1, 12, 40, 13), 'Data XXX', '')

Press any key to continue . . .
Patrick Artner
  • 50,409
  • 9
  • 43
  • 69
0

First convert your time into a datedate object. It would then be possible to advance this by your required number of seconds using a timedelta object.

This scripts keeps reading rows in until it reaches your next boundary. It then writes the accumulated rows to an output CSV file using the starting time as a filename:

from datetime import datetime, timedelta
import csv

def output_csv(output):
    filename = "{}.csv".format(get_dt(output[0]).strftime("%H_%M_%S"))

    with open(filename, 'w', newline='') as f_output:
        csv_writer = csv.writer(f_output)
        csv_writer.writerow(header)
        csv_writer.writerows(output)

get_dt = lambda x: datetime.strptime(x[1], '%H:%M:%S')
seconds = timedelta(seconds=3)      # set number of seconds to advance 

with open('input.csv', 'r', newline='') as f_input:
    csv_reader = csv.reader(f_input)
    header = next(csv_reader)
    output = [next(csv_reader)]
    read_until = get_dt(output[0]) + seconds

    for row in csv_reader:
        if get_dt(row) >= read_until:
            read_until += seconds
            output_csv(output)
            output = []
        output.append(row)

output_csv(output)

For example your first CSV would be 12_40_05.csv:

Col 0,Col 1,Col 2,Col 3
Data YYY,12:40:05,Data XXX
Data YYY,12:40:06,Data XXX
Data YYY,12:40:07,Data XXX
Martin Evans
  • 45,791
  • 17
  • 81
  • 97
  • Thank you Martin for your code, but I got below error after running the program: Traceback (most recent call last): File "C:\Users\stackoverflow_MartinEvans.py", line 17, in header = next(csv_reader) _csv.Error: iterator should return strings, not bytes (did you open the file in text mode?) – Hawzhin Dec 02 '17 at 22:21
  • You are using Python 3.x. I have updated the script accordingly. – Martin Evans Dec 03 '17 at 10:30