-1

There are two dataframes:

df1

name foo bar
0 value1 value2 value3
1 value4 value5 value6
2 value7 value8 value9
3 value10 value11 value12
4 value13 value14 value15

df2

name foo bar
0 value10 value20 value30
1 value40 value50 value60
2 value70 value80 value90
3 value100 value110 value120
4 value130 value140 value150

Tell me how to create dictionaries from two dataframes. And finally concat, as in the example below? In a real dataframes there are 10000 rows, 100 columns. Dataframes are generated in real time in a loop. Most likely, I can not add them all at once to the list. I can only gradually add them to each other in iteration. If it is optimal to use only incremental concat, then how to avoid constant copying of frames?

name foo bar
0 value1 value2 value3
1 value4 value5 value6
2 value7 value8 value9
3 value10 value11 value12
4 value13 value14 value15
0 value10 value20 value30
1 value40 value50 value60
2 value70 value80 value90
3 value100 value110 value120
4 value130 value140 value150

I am based on the article: Why does concatenation of DataFrames get exponentially slower? and benchmark: https://perfpy.com/16#/

I have tried the following without success. The problem is that I do not really understand how to construct a dictionary from a dataframe correctly in my case.

rows = []

df_a = df1
df_a = df_a.to_dict('dict')

rows.append(df_a)

df_a = df2
df_a = df_a.to_dict('dict')

rows.append(df_a)

df = pd.DataFrame(rows)

As a result, I want to make a cycle of 40 dataframes, similar to df1 and df2, with different values. And finally add the dictionaries into one, making a dataframe out of it

  • 3
    Why use the dict step? Why don't you want to concatenate the dataframes directly? – 9769953 Jun 30 '23 at 14:24
  • Given the linked question: If this is about the fear of slowing things down (have you tested that this is a real problem for your case?!): if you know the sizes of the dataframes, you can reserve the size of the final dataframe beforehand, and copy/move the data into that final dataframe. That may be much faster than any other solution. – 9769953 Jun 30 '23 at 14:27
  • @9769953 40 concatenations are planned. And in the benchmark there is an example with a dictionary. Also it is written that speed of association is much higher. All dataframes are generated in real time in a loop. Can you write an example of your idea how to do it faster? For me, speed is important, and not exactly the step with dictionaries. And I'm afraid of creating many copies of dataframes – sergeyvyazov Jun 30 '23 at 14:31
  • The benchmark isn't that appropriate: it deals with 1-row (and 3 column) dataframes and dicts. Small data structures tend to be faster in pure Python anyway; Pandas (and NumPy) will be much more efficient for large data structures. Which is why you need to test this for your own, relevant code. Your example does show small dataframes: 5 rows by 3 columns. Is that the actual dataframe size for your actual problem? – 9769953 Jun 30 '23 at 14:31
  • Don't go to dictionary. Put append your "dataframe parts" into a list then pd.concat that list of dataframes after looping. – Scott Boston Jun 30 '23 at 14:32
  • @ScottBoston That idea comes from a potentially irrelevant benchmark linked in the question. What is really optimal, depends on the actual problem size; which we don't know. – 9769953 Jun 30 '23 at 14:33
  • @9769953 No. In a real dataframe there are 10000 rows, 100 columns – sergeyvyazov Jun 30 '23 at 14:33
  • Please add that to your question. Also add what the time constraints are: "as fast as possible" is not an appropriate answer, since usually, that would just mean throwing more money at the hardware. – 9769953 Jun 30 '23 at 14:34
  • Can you clarify: do you want to add 40 dataframes together, in a continuous loop, or a loop of 40 iterations where 2 dataframes are added together, or a loop where a dataframe is added each iteration? – 9769953 Jun 30 '23 at 14:36
  • With 10.000 by 100 sized dataframes, I very much doubt using dicts will be faster. But you can test that with randomly generated dataframes for starters. The solution proposed by ScottBoston might be good enough for you, but it depends on the actual problem; see also my comment above. – 9769953 Jun 30 '23 at 14:37
  • @9769953 I want to add to dataframe 1 in every iteration (about 40) the generated dataframe 2 – sergeyvyazov Jun 30 '23 at 14:42
  • @Scott Boston Can you please tell me how to correctly write the code for your proposal? – sergeyvyazov Jun 30 '23 at 14:43
  • @Scott Boston Dataframes are generated in real time in a loop. Most likely, I can not add them all at once to the list. I can only gradually add them to each other in iteration – sergeyvyazov Jun 30 '23 at 14:45
  • Then it will become more and more inefficient – mozway Jun 30 '23 at 14:47
  • @mozway How to avoid inefficiency? – sergeyvyazov Jun 30 '23 at 14:53
  • To clarify one last thing: so you want to end up with a dataframe with 400.000 rows, is that correct? – 9769953 Jun 30 '23 at 15:02
  • @9769953 Yes exactly – sergeyvyazov Jun 30 '23 at 15:07
  • Then ScottBoston's answer is likely the most appropriate one. So is the linked duplicate, even though it answers a different question (it's just that the final answer is the same). – 9769953 Jun 30 '23 at 15:09
  • @9769953 Yes, I tried it. And it works well. Thank you very much! – sergeyvyazov Jun 30 '23 at 15:14

1 Answers1

0

I would do it like this:

list_dfs = []

for df in [df1, df2]:
   list_dfs.append(df)

df_output = pd.concat(list_dfs)

As opposed to doing something like this (this is bad and memory intensive and slow for large numbers of dataframes):

df_out = pd.DataFrame()
for df in [df1, df2]:
     df_out = pd.concat([df_out, df])
     #or
     df_out.append(df)
Scott Boston
  • 147,308
  • 15
  • 139
  • 187