0

The data that I'm using looks like this:

csv1 = pd.DataFrame({'D': [1-10, 2-10, 3-10, 4-10,...], #dates
...:                'C': [#, #, #, #,...]} #values

csv2 = pd.DataFrame({'D': [3-10, 4-10, 5-10, 6-10,...], #dates
...:                'C': [#, #, #, #,...]} #values

csv3 = pd.DataFrame({'D': [5-10, 6-10, 7-10, 8-10,...], #dates
...:                'C': [#, #, #, #,...]} #values
.
.
.
csv100 = pd.DataFrame({'D': [5-10, 6-10, 7-10, 8-10,...], #dates
...:                'C': [#, #, #, #,...]} #values

I want a data frame like this:

df_merged = pd.DataFrame({'D': [1-10,2-10,3-10,4-10,5-10,6-10...] #dates
...:                  'C1': [#, #, #, #, #, #...]} #values
                      'C2': [#, #, #, #, #, #...]} #values
                      'C3': [#, #, #, #, #, #...]} #values
                      .
                      .
                      .
                      'C100': [#, #, #, #, #, #]} #values

I have been trying to merge multiple data frames, around 100, that have the same columns but different rows (they don’t have the same order), I would like to do it by the column 'date' (to merge every row with the same date). Because the amount of data frames is high, and changes over time (today I could have 110, tomorrow I could have 90...), the method of using a loop to merge each one of them is too slow. By researching for a solution, I found that the consensus is to use dictionaries. I applied this solution to my code but I got an error and I don’t know how to solve it. The code is the following

import pandas as pd
import subprocess
import os
from functools import reduce

path=r'C:\Users\ra\Desktop\Px\a' #Folder 'a' path

df = {} #Dictionary of data frames from csv files in Folder 'a'
x = [#vector that contains the name of the csv file as string]
i = 0
for j in range(len(x)):
    df['df%s' %j] = (pd.read_csv(os.path.join(path,r'%s.csv' % x[i]))) #Assigns a key to the data frame Ex.:'df1' (the key is a string and I think this is the problem)
    df['df%s' %j].rename(columns={'C': '%s' % x[i]}, inplace=True) #Renames the column 'C' of every data frame to the name of the file
    i += 1

df_merged = reduce(lambda  left,right: pd.merge(left,right,on=['D'],how='outer'),df) #Merges every data frame to a single data frame 'df_merged' by column 'D' that represents the date.

The problem is in the last line, the output is the following:

---> df_merged = reduce(lambda  left,right: pd.merge(left,right,on=['D'],how='outer'),df)
.
.
.
ValueError: can not merge DataFrame with instance of type <class 'str'>

If I change the key from string to integer (by changing the vector x to simple numbers 'j') I get the following output:

---> df_merged = reduce(lambda  left,right: pd.merge(left,right,on=['D'],how='outer'),df)
.
.
.
ValueError: can not merge DataFrame with instance of type <class 'int'>

To make the code work, I tried to find a way to convert the string keys to names. But, apparently, that is a sin. Also, according to @AnkitMalik the 'reduce' method can't be used with dictionaries. How can I merge all this data frames by the column 'D' in a pythonic way if the keys in the dictionary are strings/integers? Or, How can I make a dynamic list of data frames if their number changes over time depending on the amount of csv files in folder 'a'?

Georgy
  • 12,464
  • 7
  • 65
  • 73
  • 1
    Please edit your post to include a [**Minimal, Complete, Verifiable Example**](https://stackoverflow.com/help/mcve). – Alex Mar 01 '18 at 00:43
  • Possible duplicate. Check this: https://stackoverflow.com/questions/23668427/pandas-joining-multiple-dataframes-on-columns – YOLO Mar 01 '18 at 01:22
  • @ManishSaraswat The link shows the following list: `dfs = [df0, df1, df2, dfN]`. I don't have a list, what I have is a dictionary that looks like this: `dfs = {'df0', 'df1', 'df2',..., 'dfN'}`The keys in the dictionary are strings, in order to make a list of data frames, as sugested by @AnkitMalik, I have to know the amount of data frames. But, they change over time depending of the amount of csv files in folder `'a'` – Alberto_Márquez Mar 01 '18 at 02:13
  • To avoid the error you just had to use `df.values()` instead of `df` as the last argument of `reduce`. But it's better to use `pd.concat` instead as @the_constant said in [their answer](https://stackoverflow.com/a/49041824/7851470). It's less expensive, and also you will have less pain with column names, as `pd.merge` will give you somehing like `C_x C_y C_x C_y ...`. – Georgy Apr 25 '19 at 09:25

3 Answers3

1

Merging or appending each DataFrame is very expensive, so it's important to make as few of calls as possible.

What you can do however, is make the date column of each DataFrame the index of the DataFrame, put them in a list, and then make one call to pandas.concat() for all of them.

You will of course have to fiddle with the column names and what they represent, as unless you want a specific entry to be a tuple, you'll have some common columns.

Example:

>>> import pandas
>>> df_0 = pandas.DataFrame(
        {
            'a': pandas.date_range('20180101', '20180105'), 
            'b': range(5, 10)
        }, 
        index=range(5)
    )
>>> df_0
           a  b
0 2018-01-01  5
1 2018-01-02  6
2 2018-01-03  7
3 2018-01-04  8
4 2018-01-05  9
>>> df_1 = pandas.DataFrame(
        {
            'a': pandas.date_range('20180103', '20180107'), 
            'b': range(5, 10)
        }, 
        index=range(5)
    )
>>> df_2 = pandas.DataFrame(
        {
            'a': pandas.date_range('20180105', '20180109'), 
            'b': range(5, 10)
        }, 
        index=range(5)
    )
>>> df_0 = df_0.set_index('a')
>>> df_1 = df_1.set_index('a')
>>> df_2 = df_2.set_index('a')
>>> pandas.concat([df_0, df_1, df_2], axis=1)  # this is where the magic happens
              b    b    b
a
2018-01-01  5.0  NaN  NaN
2018-01-02  6.0  NaN  NaN
2018-01-03  7.0  5.0  NaN
2018-01-04  8.0  6.0  NaN
2018-01-05  9.0  7.0  5.0
2018-01-06  NaN  8.0  6.0
2018-01-07  NaN  9.0  7.0
2018-01-08  NaN  NaN  8.0
2018-01-09  NaN  NaN  9.0
the_constant
  • 681
  • 4
  • 11
  • What would you do @NoticeMeSenpai if you have to concatenate multiple amounts of data frames in order to make a list to use `pandas.concat`. For example: if at 11:00 A.M. you have 90 data frames named `[df1, df2,..., df90]` and at 11:10 A.M. you have 110 data frames named `[df1, df2,..., df110]`(the change in the amount of data frames depends on the amount of csv files in a folder that is updated constantly). How would you use python to create such a list of dynamic data frames automatically? – Alberto_Márquez Mar 01 '18 at 03:06
  • If the df's don't change once they exist (like, 1 always stays 1, 2 always stays 2), I'd write a memoize class that would cache the results of seen dataframes and then use the cached results to concat the new df's that haven't been seen. If the df's are constantly changing, you have no choice but to redo the process every time, and your best solution depends on the resources you have (such as parallel mapping the calculation into pieces and utilize multiprocessing) – the_constant Mar 01 '18 at 04:20
0

reduce would work on a list instead of a dictionary.

Try this:

Create a list of data frames (df)

import pandas as pd
import subprocess
import os
from functools import reduce

path='C:\Users\ra\Desktop\Px\a\'

df = []
x = [#vector that contains the name of the csv files as string]
for j in x:
    df.append(pd.read_csv(path+j+'.csv')) 

df_merged = functools.reduce(lambda left, right: pd.merge(left, right, how= 'outer', on = ['D']), df)
  • Could you tell me how to do it @Ankit Malik, I'm sorry if that is too basic but I'm a newbie – Alberto_Márquez Mar 01 '18 at 01:10
  • I have edited my answer. Create a list of data frames after reading them all. Accept if it works for you. – Ankit Malik Mar 01 '18 at 01:13
  • `dataframe_list = [df1, df2, df3, .... df100]` This assumes that I know the amount of data frames that I have to merge. But, the reality is that the `dataframe_list` will have a dynamic number of data frames that will change over time depending on the amount of files in folder `a`. – Alberto_Márquez Mar 01 '18 at 01:26
  • You can create a list of df just like a dictionary of df. Please see the latest edit. – Ankit Malik Mar 01 '18 at 07:57
  • I tried your code and it allows me to use the reduce method. But, I got a `MemoryError:` output. In this case I think my only choice is to concatenate the data frames – Alberto_Márquez Mar 01 '18 at 14:54
0

First of all, I want to thank every one that helped me to find a solution. I have to say that this is my first time posting a question in stackoverflow and the experience has been very nice. I also want to thank @AnkitMalik and @NoticeMeSenpai because their effort helped me to find a very good solution.

My question was about merging data frames in a dictionary {} by using functools.reduce(). But, as was pointed out by @AnkitMalik, this only works for lists []. @NoticeMeSenpai recomended the use of pandas.concat() in order to make this work. The code below is the one that works for me:

import pandas as pd
import subprocess
import os

path='C:\Users\ra\Desktop\Px\a'

df = [] #makes a list of data frames
x = [#vector that contains the name of the csv files as strings]
for j in x:
    df.append((pd.read_csv(os.path.join(path,r'%s.csv' % j))).set_index('D').rename(columns={'C':'%s' % j}), axis=1)) #appends every csv file in folder 'a' as a data frame in list 'df', sets the column 'D' as index and renames the column 'C' as the name of csv file.

df_concat = pd.concat(df, axis=1) #concats every data frame in the list 'df'
df_concat.to_csv(os.path.join(path,r'xxx.csv')) # saves the concatenated data frame in the 'xxx' csv file in folder 'a'.