I am trying to merge 100+ csv files into a single csv file. Each file has a timestamp column and a data column. Some of the files have the same data column but different timestamps.
I'm getting a list of filenames by searching through a directory with os.walk() then looping through this list, opening each file as a dataframe and merging it with an output dataframe. This starts off empty but gets data added on each loop.
Here's the gist of the code:
output = pd.DataFrame(columns=['TimeStamp'])
for filename in file_list:
df = pd.read_csv(filename, sep=',', skiprows=2, header=None, encoding='utf-16')
# convert dtypes
df['TimeStamp'] = pd.to_datetime(df['TimeStamp'], dayfirst=True)
df[tag_name] = pd.to_numeric(df[tag_name])
#tag_name comes from a dictionary matching the file with a label
# here are the two approaches I'm trying:
# approach 1
output= output.merge(df, how='outer', on='TimeStamp', sort=True)
# approach 2 - this has the output df initialised with the columns from the tag_name dictionary
output = output.merge(df, how='outer', on=['TimeStamp', tag_name], sort=True)
Here's an example of the data:
'TimeStamp', 'Meter 1'
2019-01-01 00:00, 12
2019-01-01 01:00, 17
2019-01-01 02:00, 10
'TimeStamp', 'Meter 1'
2019-01-01 03:00, 13
2019-01-01 04:00, 20
2019-01-01 05:00, 9
'TimeStamp', 'Meter 2'
2019-01-01 00:00, 1
2019-01-01 01:00, 6
2019-01-01 02:00, 5
Expected result for the 3 above:
'TimeStamp', 'Meter 1', 'Meter 2'
2019-01-01 00:00, 12, 1
2019-01-01 01:00, 17, 6
2019-01-01 02:00, 10, 5
2019-01-01 03:00, 13,
2019-01-01 04:00, 20,
2019-01-01 05:00, 9,
Result for approach 1:
'TimeStamp', 'Meter 1', 'Meter 1_x', 'Meter 2'
2019-01-01 00:00, 12, , 1
2019-01-01 01:00, 17, , 6
2019-01-01 02:00, 10, , 5
2019-01-01 03:00, , 13,
2019-01-01 04:00, , 20,
2019-01-01 05:00, , 9,
Result for approach 2:
'TimeStamp', 'Meter 1', 'Meter 2'
2019-01-01 00:00, 12,
2019-01-01 00:00, , 1
2019-01-01 01:00, 17,
2019-01-01 01:00, , 6
2019-01-01 02:00, 10,
2019-01-01 02:00, , 5
2019-01-01 03:00, 13,
2019-01-01 04:00, 20,
2019-01-01 05:00, 9,
They're both almost there but not quite. Is there a way to achieve this with merge or do I need a different approach entirely?
I was trying to figure out a way to add the matching columns together from approach 1 but there's an irregular number of columns each time. I'll try this again in the morning.
Edit: The other question linked as answering this one, while a fantastic resource, does not actually deal with this case where there are duplicate column names that need to be merged together. The closest part in that answer uses a solution with functools.partial but says if you have duplicate column names you may need to use a lambda, without elaborating further. I don't know how I would implement that solution with a lambda function.
I tried this approach on a small set of files and it didn't fail without a lambda function but did produce the same results as approach 2 in my own code. It is significantly faster than my approach though.
from functools import reduce, partial
outer_merge = partial(pd.merge, how='outer')
reduce(outer_merge, dfs)
I'm thinking this is now an issue with pandas seeing the timestamps as not equal. The same thing happens when I leave them as strings though.
Edit 2: The top of the actual csv files as seen in a text editor:
"sep=,"
"","Meter_tag",""
"Time","Average(Time Weighted)",""
"01/06/2017 00:00:00","0.000",""
Edit 3: Thanks to Valentino for helping with this. I ended up using a workaround so my output was like approach 2 above but then I just grouped it hourly and it squashed the extra rows down. It was just summing the actual data with zeroes so the sum operation doesn't change the data.
output= output.groupby(pd.Grouper(key='TimeStamp', freq='1H')).sum().reset_index()