0

If I have multiple csv files each containing timeseries data indexed by date. Is there a way to create a single dataframe containging all the data with the index adjusting for new dates that may not have been seen previously in the prior files. For example say I read in timeseries 1:

03/01/2001  2.984
04/01/2001  3.016
05/01/2001  2.891
08/01/2001  2.527
09/01/2001  2.445
11/01/2001  2.648
12/01/2001  2.803
15/01/2001  2.943

The dataframe would look pretty much like the data above. But if I then read in another file say timeseries 2

02/01/2001  24.75
03/01/2001  24.35
04/01/2001  25.1
08/01/2001  23.5
09/01/2001  23.6
10/01/2001  24.5
11/01/2001  24.7
12/01/2001  24.4

You can see that timeseries 1 has a value for 05/01/2001 and timeseries 2 does not. Also timeseries 2 has data points for 02/01/2001 and 10/01/2001. So is there a way to end up with the following:

02/01/2001  null    24.75 ..etc
03/01/2001  2.984   24.35 ..etc
04/01/2001  3.016   25.1  ..etc
05/01/2001  2.891   null  ..etc
08/01/2001  2.527   23.5  ..etc
09/01/2001  2.445   23.6  ..etc
10/01/2001  null    24.5  ..etc
11/01/2001  2.648   24.7  ..etc
12/01/2001  2.803   24.4  ..etc
15/01/2001  2.943   null  ..etc

where the index adjusts for the new dates and any timeseries with out data for that day is set to null or some such value?

My code so far is fairly basic, I can walk through a directory and open .csv files and ready them into a dataframe but I do not know how to combine the dataframes together in the way outlined above.

    def getTimeseriesData(DataPath,columnNum,startDate,endDate):
        #print('startDate: ',startDate,' endDate: ',endDate)
        colNames = ['date']

        path = DataPath
        print('DataPath: ',DataPath)
        filePath = path, "*.csv"
        allfiles = glob.glob(os.path.join(path, "*.csv"))
        for fname in allfiles:
            name = os.path.splitext(fname)[0]
            name = os.path.split(name)[1]

            colNames.append(name)

        dataframes = [pd.read_csv(fname, header=None,usecols=[0,columnNum]) for fname in allfiles]
#not sure of the next bit
halfer
  • 19,824
  • 17
  • 99
  • 186
Stacey
  • 4,825
  • 17
  • 58
  • 99

3 Answers3

1

pd.concat can be used to concatenate DataFrames with different indexes. For example,

df1 = pd.DataFrame({'A': list('ABCDE')}, index=range(5))
df2 = pd.DataFrame({'B': list('ABCDE')}, index=range(2,7))
pd.concat([df1, df2], axis=1)

yields

     A    B
0    A  NaN
1    B  NaN
2    C    A
3    D    B
4    E    C
5  NaN    D
6  NaN    E

Notice that the indexes of df1 and df2 are aligned and NaN is used wherever there is a missing value.


So in your case, if you use

pd.read_csv(fname, header=None, usecols=[0,column_num], parse_dates=[0],
            index_col=[0], names=['date', name]))

the index_col=[0] will make the first column the index of the DataFrame, so that later on, calling

dfs = pd.concat(dfs, axis=1)

will result in one DataFrame with all the DataFrames aligned based on the dates.


With data1.csv and data2.csv placed in ~/tmp,

import glob
import os
import pandas as pd

def get_timeseries_data(path, column_num):
    colNames = ['date']
    dfs = []
    allfiles = glob.glob(os.path.join(path, "*.csv"))
    for fname in allfiles:
        name = os.path.splitext(fname)[0]
        name = os.path.split(name)[1]
        colNames.append(name)
        df = pd.read_csv(fname, header=None, usecols=[0, column_num], 
                        parse_dates=[0], dayfirst=True,
                        index_col=[0], names=['date', name])

        # aggregate rows with duplicate index by taking the mean
        df = df.groupby(level=0).agg('mean')

        # alternatively, drop rows with duplicate index
        # http://stackoverflow.com/a/34297689/190597 (n8yoder)
        # df = df[~df.index.duplicated(keep='first')]

        dfs.append(df)
    dfs = pd.concat(dfs, axis=1)
    return dfs

path = os.path.expanduser('~/tmp/tmp')
column_num = 1
dfs = get_timeseries_data(path, column_num)
print(dfs)

yields

            data1  data2
date                    
2001-01-02    NaN  24.75
2001-01-03  2.984  24.35
2001-01-04  3.016  25.10
2001-01-05  2.891    NaN
2001-01-08  2.527  23.50
2001-01-09  2.445  23.60
2001-01-10    NaN  24.50
2001-01-11  2.648  24.70
2001-01-12  2.803  24.40
2001-01-15  2.943    NaN
unutbu
  • 842,883
  • 184
  • 1,785
  • 1,677
  • I've implemented the above code but receive an error 'InvalidIndexError: Reindexing only valid with uniquely valued Index objects' an idea why that might be? – Stacey Apr 20 '16 at 21:09
  • A DataFrame in `dfs` may have more than one row associated to the same date. If so, you'll have to decide how you wish to handle duplicate dates. You could, for example, simply drop all but one of the duplicate dates. (I've edited the post above to show how.) – unutbu Apr 20 '16 at 21:38
  • Alternatively, you could use `groupby/agg` to aggregate rows with duplicate dates into a single row. Or, if you want to propagate the duplicated index across multiple DataFrames, then you'll need to use `pd.merge` instead of `pd.concat`. That would require a solution that looks more like [flyingmeatball's](http://stackoverflow.com/a/36754677/190597). – unutbu Apr 20 '16 at 21:38
0

Maybe not the most elegant, but I would create a timeseriesindex going from the minimum date to the maximum date of all your csv files, call that dataframe say df, and then do df['file1']=pd.read_csv('file1.csv'). You will then have some rows which will be all NaN, and you can filter for these and remove them.

alex314159
  • 3,159
  • 2
  • 20
  • 28
0

try something like this using merge.

df1 = pd.DataFrame([['03/01/2001', 2.984],['04/01/2001', 3.016],['05/01/2001',2.891],['08/01/2001', 2.527],
       ['09/01/2001', 2.445],['11/01/2001',2.648],
       ['12/01/2001', 2.803],['15/01/2001',2.943]], columns = ['date','field'])

df2 = pd.DataFrame([['02/01/2001',  24.75],['03/01/2001',  24.35],['04/01/2001', 25.1],['08/01/2001',  23.5],
       ['09/01/2001',  23.6], ['10/01/2001',  24.5],['11/01/2001',  24.7],['12/01/2001',  24.4]], columns = ['date','field'])

#files in your directory
files= [df1,df2]

fileNo = 1
for currFile in files:
    if fileNo ==1:
        df = currFile
    else:
        currFile.rename(columns = {'field':'field_fromFile_' + str(fileNo)})
        df = pd.merge(df, currFile, how = 'outer',left_on = 'date',right_on = 'date')
    fileNo =fileNo + 1
flyingmeatball
  • 7,457
  • 7
  • 44
  • 62