Concat pandas dataframes as dicts

Question

There are two dataframes:

df1

	name	foo	bar
0	value1	value2	value3
1	value4	value5	value6
2	value7	value8	value9
3	value10	value11	value12
4	value13	value14	value15

df2

	name	foo	bar
0	value10	value20	value30
1	value40	value50	value60
2	value70	value80	value90
3	value100	value110	value120
4	value130	value140	value150

Tell me how to create dictionaries from two dataframes. And finally concat, as in the example below? In a real dataframes there are 10000 rows, 100 columns. Dataframes are generated in real time in a loop. Most likely, I can not add them all at once to the list. I can only gradually add them to each other in iteration. If it is optimal to use only incremental concat, then how to avoid constant copying of frames?

	name	foo	bar
0	value1	value2	value3
1	value4	value5	value6
2	value7	value8	value9
3	value10	value11	value12
4	value13	value14	value15
0	value10	value20	value30
1	value40	value50	value60
2	value70	value80	value90
3	value100	value110	value120
4	value130	value140	value150

I am based on the article: Why does concatenation of DataFrames get exponentially slower? and benchmark: https://perfpy.com/16#/

I have tried the following without success. The problem is that I do not really understand how to construct a dictionary from a dataframe correctly in my case.

rows = []

df_a = df1
df_a = df_a.to_dict('dict')

rows.append(df_a)

df_a = df2
df_a = df_a.to_dict('dict')

rows.append(df_a)

df = pd.DataFrame(rows)

As a result, I want to make a cycle of 40 dataframes, similar to df1 and df2, with different values. And finally add the dictionaries into one, making a dataframe out of it

Why use the dict step? Why don't you want to concatenate the dataframes directly? — 9769953, Jun 30 '23 at 14:24
Given the linked question: If this is about the fear of slowing things down (have you tested that this is a real problem for your case?!): if you know the sizes of the dataframes, you can reserve the size of the final dataframe beforehand, and copy/move the data into that final dataframe. That may be much faster than any other solution. — 9769953, Jun 30 '23 at 14:27
@9769953 40 concatenations are planned. And in the benchmark there is an example with a dictionary. Also it is written that speed of association is much higher. All dataframes are generated in real time in a loop. Can you write an example of your idea how to do it faster? For me, speed is important, and not exactly the step with dictionaries. And I'm afraid of creating many copies of dataframes — sergeyvyazov, Jun 30 '23 at 14:31
The benchmark isn't that appropriate: it deals with 1-row (and 3 column) dataframes and dicts. Small data structures tend to be faster in pure Python anyway; Pandas (and NumPy) will be much more efficient for large data structures. Which is why you need to test this for your own, relevant code. Your example does show small dataframes: 5 rows by 3 columns. Is that the actual dataframe size for your actual problem? — 9769953, Jun 30 '23 at 14:31
Don't go to dictionary. Put append your "dataframe parts" into a list then pd.concat that list of dataframes after looping. — Scott Boston, Jun 30 '23 at 14:32
@ScottBoston That idea comes from a potentially irrelevant benchmark linked in the question. What is really optimal, depends on the actual problem size; which we don't know. — 9769953, Jun 30 '23 at 14:33
@9769953 No. In a real dataframe there are 10000 rows, 100 columns — sergeyvyazov, Jun 30 '23 at 14:33
Please add that to your question. Also add what the time constraints are: "as fast as possible" is not an appropriate answer, since usually, that would just mean throwing more money at the hardware. — 9769953, Jun 30 '23 at 14:34
Can you clarify: do you want to add 40 dataframes together, in a continuous loop, or a loop of 40 iterations where 2 dataframes are added together, or a loop where a dataframe is added each iteration? — 9769953, Jun 30 '23 at 14:36
With 10.000 by 100 sized dataframes, I very much doubt using dicts will be faster. But you can test that with randomly generated dataframes for starters. The solution proposed by ScottBoston might be good enough for you, but it depends on the actual problem; see also my comment above. — 9769953, Jun 30 '23 at 14:37
@9769953 I want to add to dataframe 1 in every iteration (about 40) the generated dataframe 2 — sergeyvyazov, Jun 30 '23 at 14:42
@Scott Boston Can you please tell me how to correctly write the code for your proposal? — sergeyvyazov, Jun 30 '23 at 14:43
@Scott Boston Dataframes are generated in real time in a loop. Most likely, I can not add them all at once to the list. I can only gradually add them to each other in iteration — sergeyvyazov, Jun 30 '23 at 14:45
To clarify one last thing: so you want to end up with a dataframe with 400.000 rows, is that correct? — 9769953, Jun 30 '23 at 15:02
Then ScottBoston's answer is likely the most appropriate one. So is the linked duplicate, even though it answers a different question (it's just that the final answer is the same). — 9769953, Jun 30 '23 at 15:09
@9769953 Yes, I tried it. And it works well. Thank you very much! — sergeyvyazov, Jun 30 '23 at 15:14

score 0 · Accepted Answer · answered Jun 30 '23 at 14:47

I would do it like this:

list_dfs = []

for df in [df1, df2]:
   list_dfs.append(df)

df_output = pd.concat(list_dfs)

As opposed to doing something like this (this is bad and memory intensive and slow for large numbers of dataframes):

df_out = pd.DataFrame()
for df in [df1, df2]:
     df_out = pd.concat([df_out, df])
     #or
     df_out.append(df)

Concat pandas dataframes as dicts

1 Answers1