1

I am trying to merge 100+ csv files into a single csv file. Each file has a timestamp column and a data column. Some of the files have the same data column but different timestamps.

I'm getting a list of filenames by searching through a directory with os.walk() then looping through this list, opening each file as a dataframe and merging it with an output dataframe. This starts off empty but gets data added on each loop.

Here's the gist of the code:

output = pd.DataFrame(columns=['TimeStamp'])
for filename in file_list:
    df = pd.read_csv(filename, sep=',', skiprows=2, header=None, encoding='utf-16')

    # convert dtypes
    df['TimeStamp'] = pd.to_datetime(df['TimeStamp'], dayfirst=True)
    df[tag_name] = pd.to_numeric(df[tag_name]) 
    #tag_name comes from a dictionary matching the file with a label

    # here are the two approaches I'm trying:
    # approach 1
    output= output.merge(df, how='outer', on='TimeStamp', sort=True)

    # approach 2 - this has the output df initialised with the columns from the tag_name dictionary
    output = output.merge(df, how='outer', on=['TimeStamp', tag_name], sort=True)

Here's an example of the data:

'TimeStamp',      'Meter 1'
2019-01-01 00:00, 12
2019-01-01 01:00, 17
2019-01-01 02:00, 10
'TimeStamp',      'Meter 1'
2019-01-01 03:00, 13
2019-01-01 04:00, 20
2019-01-01 05:00, 9
'TimeStamp',      'Meter 2'
2019-01-01 00:00, 1
2019-01-01 01:00, 6
2019-01-01 02:00, 5

Expected result for the 3 above:

'TimeStamp',      'Meter 1', 'Meter 2'
2019-01-01 00:00, 12,        1
2019-01-01 01:00, 17,        6
2019-01-01 02:00, 10,        5
2019-01-01 03:00, 13,
2019-01-01 04:00, 20,
2019-01-01 05:00, 9,

Result for approach 1:

'TimeStamp',      'Meter 1', 'Meter 1_x', 'Meter 2'
2019-01-01 00:00, 12,        ,           1
2019-01-01 01:00, 17,        ,           6
2019-01-01 02:00, 10,        ,           5
2019-01-01 03:00, ,          13,
2019-01-01 04:00, ,          20,
2019-01-01 05:00, ,          9,

Result for approach 2:

'TimeStamp',      'Meter 1',    'Meter 2'
2019-01-01 00:00, 12,
2019-01-01 00:00, ,              1
2019-01-01 01:00, 17,
2019-01-01 01:00, ,              6
2019-01-01 02:00, 10,
2019-01-01 02:00, ,              5
2019-01-01 03:00, 13,
2019-01-01 04:00, 20,
2019-01-01 05:00, 9,

They're both almost there but not quite. Is there a way to achieve this with merge or do I need a different approach entirely?

I was trying to figure out a way to add the matching columns together from approach 1 but there's an irregular number of columns each time. I'll try this again in the morning.


Edit: The other question linked as answering this one, while a fantastic resource, does not actually deal with this case where there are duplicate column names that need to be merged together. The closest part in that answer uses a solution with functools.partial but says if you have duplicate column names you may need to use a lambda, without elaborating further. I don't know how I would implement that solution with a lambda function.

I tried this approach on a small set of files and it didn't fail without a lambda function but did produce the same results as approach 2 in my own code. It is significantly faster than my approach though.

from functools import reduce, partial
outer_merge = partial(pd.merge, how='outer')
reduce(outer_merge, dfs)

I'm thinking this is now an issue with pandas seeing the timestamps as not equal. The same thing happens when I leave them as strings though.

Edit 2: The top of the actual csv files as seen in a text editor:

"sep=,"
"","Meter_tag",""
"Time","Average(Time Weighted)",""
"01/06/2017 00:00:00","0.000",""

Edit 3: Thanks to Valentino for helping with this. I ended up using a workaround so my output was like approach 2 above but then I just grouped it hourly and it squashed the extra rows down. It was just summing the actual data with zeroes so the sum operation doesn't change the data.

output= output.groupby(pd.Grouper(key='TimeStamp', freq='1H')).sum().reset_index()
Mitch
  • 71
  • 1
  • 10

1 Answers1

2

It's easier than you think.

output= output.merge(df, how='outer', sort=True)

Just drop the on keyword parameter. If on=None (the default), the docs says:

If on is None and not merging on indexes then this defaults to the intersection of the columns in both DataFrames.

Using your three sample files, it will give you:

            TimeStamp        Meter 1        Meter 2
0 2019-01-01 00:00:00             12            1.0
1 2019-01-01 01:00:00             17            6.0
2 2019-01-01 02:00:00             10            5.0
3 2019-01-01 03:00:00             13            NaN
4 2019-01-01 04:00:00             20            NaN
5 2019-01-01 05:00:00              9            NaN

Be careful: if some files have overlapping TimeStamp values and same columns, you will end with repeated TimeStamp values. This case is not covered in your sample files, so I assume you are sure this situation never happens.

Valentino
  • 7,291
  • 6
  • 18
  • 34
  • The docs says it, but pandas 0.23.4 chokes on it with *MergeError: No common columns to perform merge on. Merge options: left_on=None, right_on=None, left_index=False, right_index=False* – Serge Ballesta Jun 13 '19 at 13:50
  • @SergeBallesta So it's a version problem / bug? I used pandas 0.24 and what I posted worked for me. – Valentino Jun 13 '19 at 14:09
  • I believe it is. Your solution is really neat and clean. I left the comment to warn future readers that they could have to upgrade older versions of pandas. – Serge Ballesta Jun 13 '19 at 15:12
  • Thanks for your answer but unfortunately this produced the same results as approach 2 in the question. I am using pandas 0.24.1. In the actual data there are 6 months in each file. It merges the first 6 months fine but then the remaining 3 years are as approach 2. It's merging them meter by meter as in the first 6 months of "meter 1" and then the second 6 months of "meter 1" and so on until all of meter 1 is complete before beginning on "meter 2". Could the actual problem be that's it not recognising the timestamps as equal? – Mitch Jun 14 '19 at 07:58
  • I've just wondering... why when you read the csv file you have `skiprows=2` and `header=None`? Are you actually skipping the header? Could you try what happens removing this arguments and reading the header? – Valentino Jun 14 '19 at 08:27
  • I noticed today I was skipping an extra row so it is now skiprows=1. The header isn't what I want so I'm assigning it myself. If I don't skiprows I get a line 3 has 3 columns but expected 1 type error. The top of the actual files has been added to the question. – Mitch Jun 14 '19 at 09:06
  • @Valentino I do a little bit of extra processing that I didn't include in the original code. it just drops the 3rd column in each file, which is blank, and sets the column names. I changed it so it takes the original header then renames the header but it gives the same result. – Mitch Jun 14 '19 at 09:27
  • I think the problem may lie somewhere in your extra processing, because looking at your approach 2 seems like it reads empty strings or strings with spaces only: you have commas without any data shown. I am just shooting in the dark, but try to use the following regex as separator to get rid of any extra spaces which may mess with the reading: `sep='\s*,\s*'` – Valentino Jun 14 '19 at 09:48
  • I just tried that and it ends up with the same result. There are no strings with spaces but there are empty strings. Using that regex makes it also read in the quotes around each value. str.strip('"') got rid of the quotes so I could convert the data types but it ended up being the same. It also has the same result if I leave them as strings. I'm just going to try and group the final dataframe hourly to see if that works as a workaround. – Mitch Jun 14 '19 at 10:19
  • Yeah successful workaround! Worked like a charm! I'll make a separate answer with exactly what I did. – Mitch Jun 14 '19 at 10:28