2

I have 2 years worth of daily data split into monthly files. I would like to combine all of this data into one file ordered by date and time. The code I am using combines all the files, but not in order.

Code I am using

import pandas as pd
import glob, os
import csv

inputdirectory = input('Enter the directory: ')
df_list = []

for filename in sorted(glob.glob(os.path.join(inputdirectory,"*.csv*"))):
    df_list.append(pd.read_csv(filename))
    full_df = pd.concat(df_list)
    full_df.to_csv('totalsum.csv', index=False)
acb
  • 625
  • 2
  • 10
  • 20
  • Could you mention the names of the files you're trying to open? Assuming its something like 'year-month-day_file.csv', one could input that date into a new column and then sort it via pandas. – Michael Gecht Mar 23 '17 at 15:08
  • so, what is the question/problem? – Marat Mar 23 '17 at 15:11
  • the files are named like this CB01 Apr 2015.dailysum and so on. – acb Mar 23 '17 at 15:12
  • Sorry the problem I am getting is that it is combining the data, but not sorting it by date. It is combining it however the files are ordered in the folder. – acb Mar 23 '17 at 15:13
  • Is it the filename you want to use to order, or the file creation time? For the later, you can add `key=os.path.getctime` to your `sorted`... Also - if you're not doing anything with it - creating and appending to a dataframe is a waste of memory and time... – Jon Clements Mar 23 '17 at 15:14
  • Order by date (and not creation time). So if I have data for for 2 years I would like it ordered by the date of the data. Sorry if I am not explaining it clearly. – acb Mar 23 '17 at 15:16
  • How is that date determined? From the `Apr 2015` in the filename, or is it in the data itself or...? – Jon Clements Mar 23 '17 at 15:16
  • have you tried [.sort_values()](http://pandas.pydata.org/pandas-docs/stable/generated/pandas.DataFrame.sort_values.html) on the date column after the concat? – Stael Mar 23 '17 at 15:17
  • In each monthly file there is data for every day. My code is combining the data by however it is ordered in the directory. – acb Mar 23 '17 at 15:18
  • When I try .sort_values(by = 'datecolumn') it orders it by month and not by date so it will order it 1/1/14, 1/1/15, 1/2/14, 1/2/15 instead of 1/1/14 , 1/2/14 etc... – acb Mar 23 '17 at 15:31

2 Answers2

1

Pre-process the file list to sort it:

  • Create a list of file_names,
  • extract the relevant info from the name and create a datetime object,
  • sort on the datetime object,
  • then use the sorted list.

import operator
fyles = ['CB02 May 2014.dailysum',
         'CB01 Apr 2015.dailysum',
         'CB01 Jul 2015.dailysum',
         'CB01 May 2015.dailysum',
         'CB01 Sep 2015.dailysum',
         'CB01 Oct 2015.dailysum',
         'CB13 May 2015.dailysum',
         'CB01 Jun 2017.dailysum',
         'CB01 Aug 2015.dailysum'
         ]

new_fyles = []
for entry in fyles:
    day, month, year = entry.split()
    year, _ = year.split('.')
    day = day[-2:]
##    print(entry, (month, year))
    dt = datetime.datetime.strptime(' '.join((day, month, year)), '%d %b %Y')
##    print(entry, dt)
    new_fyles.append((entry, dt))

date = operator.itemgetter(1)
f_name = operator.itemgetter(0)
new_fyles.sort(key = date)
for entry in new_fyles:
    print(f_name(entry))

You can make the file list like this:

import os, os.path
fyles = [fn for fn in os.listdir(inputdirectory) if fn.endswith('.dailysum')]

Then, after sorting, write the contents of each file to the new file:

with open('totalsum.csv', 'w') as out:
    for entry in new_fyles:
        f_path = os.path.join(inputdirectory, f_name(entry))
        with open(f_path) as f:
            out.write(f.read())

You could perform the sorting in a function

date = operator.itemgetter(1)
f_name = operator.itemgetter(0)
def f_name_sort(f_list):
    '''Return sorted list of file names'''
    new_fyles = []
    for entry in f_list:
        day, month, year = entry.split()
        year, _ = year.split('.')
        day = day[-2:]
        dt = datetime.datetime.strptime(' '.join((day, month, year)), '%d %b %Y')
        new_fyles.append((entry, dt))
    new_fyles.sort(key = date)
    return [f_name(entry) for f_name in new_fyles]

and use it like this:

for entry in f_name_sort(fyles):
    ...

Or write a function that converts a filename to a datetime object and use it as the key for sorting

def key(f_name):
    day, month, year = f_name.split()
    year, _ = year.split('.')
    day = day[-2:]
    return datetime.datetime.strptime(' '.join((day, month, year)), '%d %b %Y')

fyles.sort(key = key)
for entry in fyles:
    ...
wwii
  • 23,232
  • 7
  • 37
  • 77
1

after this line:

full_df = pd.concat(df_list)

you need to convert your column 'datecolumn' to a datetime column:

full_df['datecolumn'] = full_df['datecolumn'].to_datetime(format=r'%d/%m/%y')

(judging by your comments, that format should work)

finally you can use

full_df.sort_values(by='datecolumn').to_csv('totalsum.csv', index=False)

to sort and write it

Stael
  • 2,619
  • 15
  • 19