41

I have multiple (more than 100) dataframes. How can I concat all of them?

The problem is, that I have too many dataframes, that I can not write them manually in a list, like this:

>>> cluster_1 = pd.DataFrame([['a', 1], ['b', 2]],
...                    columns=['letter  ', 'number'])


>>> cluster_1
  letter  number
0      a       1
1      b       2


>>> cluster_2 = pd.DataFrame([['c', 3], ['d', 4]],
...                    columns=['letter', 'number'])


>>> cluster_2
  letter  number
0      c       3
1      d       4


>>> pd.concat([cluster_1, cluster_2])
  letter number
0      a       1
1      b       2
0      c       3
1      d       4

The names of my N dataframes are cluster_1, cluster_2, cluster_3,..., cluster_N. The number N can be very high.

How can I concat N dataframes?

PParker
  • 1,419
  • 2
  • 10
  • 25
  • `I can not write them manually in a list`. The solution to this has nothing to do with `concat`. You need to fix your upstream process to produce a list rather than 100s of variables. – jpp Dec 21 '18 at 00:38
  • I don't see / understand how the answer that was found in an other post, can help me with my questions. I can see how it works for some small number of dataframes, but not for many dataframes, like 100 and more. – PParker Dec 21 '18 at 00:41
  • 3
    I've added a second duplicate to help you. You need to restructure your logic to NOT create a variable number of variables. A `dict` or `list` would work fine with `pd.concat`. – jpp Dec 21 '18 at 00:43
  • @jpp I totally agree. I was trying to do this the last 2 days but I failed. – PParker Dec 21 '18 at 00:43

3 Answers3

87

I think you can just put it into a list, and then concat the list. In Pandas, the chunk function kind of already does this. I personally do this when using the chunk function in pandas.

pdList = [df1, df2, ...]  # List of your dataframes
new_df = pd.concat(pdList)

To create the pdList automatically assuming your dfs always start with "cluster".

pdList = []
pdList.extend(value for name, value in locals().items() if name.startswith('cluster_'))
Rui Nian
  • 2,544
  • 18
  • 32
  • 1
    How can I avoid writing the list pdList manually? It is getting too long assuming more than 100 dataframes. This is my key problem – PParker Dec 21 '18 at 00:31
  • 2
    Hi PParker, i updated the answer for you to create the pdList. – Rui Nian Dec 21 '18 at 00:47
  • 1
    Thank you very much. This is a nice solution and it works. For other people who want to try it, you should consider that you first initialise the pdList with pdList=[]. Additionally make sure, that you don't have other dataframes which start with "cluster_" and which have different dimensions that you don't want to consider. – PParker Dec 21 '18 at 14:36
  • @RuiNian How to concatenate if my list has dataframe names as string type ie, if my pdList=['df1','df2','df3',.....]? in this case new_df=pd.concat(pdList) throws error.. – user11580242 May 06 '21 at 21:15
  • I don't think you can concatenate it that way because the dataframes are objects in memory whereas the strings that represent the data frame names.. are just strings. Python cannot recognize that they are df names. To overcome this, all you would need to do is remove to quotations in your list. That way, your strings become the actual dataframes themselves. – Rui Nian May 07 '21 at 22:53
10

Generally it goes like:

frames = [df1, df2, df3]
result = pd.concat(frames)

Note: It will reset the index automatically. Read more details on different types of merging here.

For a large number of data frames: If you have hundreds of data frames, depending one if you have in on disk or in memory you can still create a list ("frames" in the code snippet) using a for a loop. If you have it in the disk, it can be easily done just saving all the df's in one single folder then reading all the files from that folder.

If you are generating the df's in memory, maybe try saving it in .pkl first.

zafrin
  • 434
  • 4
  • 11
  • Can you be more specific on that please? So you suggest me to export all dataframes and then read them in a list by using a loop? – PParker Dec 21 '18 at 00:35
  • 2
    How do you have the data frames saved right now? where are they saved? Or are they being generated in memory by your code? – zafrin Dec 21 '18 at 00:38
4

Use:

pd.concat(your list of column names)

And if want regular index:

pd.concat(your list of column names,ignore_index=True)
U13-Forward
  • 69,221
  • 14
  • 89
  • 114