This problem is more conceptual/theoretical (has to do with run times for very large datasets), so I apologize for not having a minimial example to show.
I have a bunch of DataFrames from two different sensors that I need to eventually concatenate into two very large DataFrames from two different sensors (df_snsr1
and df_snsr2
), and then left join into a single DataFrame. My data is such that I can also join first, then concat, or some combination. I am trying to figure out the most efficient way to do this.
From reading this SO answer I know that pandas.concat
allocates space for the concatenation of all of its dataframes, and if you do this in a loop it can lead to O(N**2)
copying and some major slowdowns. Thus I am currently first building a big list of dataframes (loaded from files), concatenating them all at once, and then joining the two big dataframes:
df_list = []
for file in my_pickle_files_snsr1: # O(M) loop over M files
df_list.append(pd.read_pickle(file)) # O(1) append, M times
df_snsr1 = pd.concat(df_list) # O(N) copies of N records
# repeat for sensor 2 (df_snsr2)
df_snsr1.join(df_snsr2, on=['some', 'columns']) # O(dunno, maybe bears?)
I am unable to find anything about execution speed in the documentation on pandas.DataFrame.join
. Is it O(N)
? O(N**2)
? My thought is that if it is similar order to pandas.concat
, then it really doesn't matter what order I do the two operations in. If it is O(N**2)
, however, then it will likely be more efficient for me to join many small dataframes and then concatenate them rather than concat and then join. The overall operation takes long enough that it is worth-while for me to ask the question on here, so "run it and see" isn't going to work.
Does anybody know what algorithm join
is using and what its execution big-O order is? Or does anybody have any other suggestions on getting the most-efficient combination of join
and concat
?