7

This problem is more conceptual/theoretical (has to do with run times for very large datasets), so I apologize for not having a minimial example to show.

I have a bunch of DataFrames from two different sensors that I need to eventually concatenate into two very large DataFrames from two different sensors (df_snsr1 and df_snsr2), and then left join into a single DataFrame. My data is such that I can also join first, then concat, or some combination. I am trying to figure out the most efficient way to do this.

From reading this SO answer I know that pandas.concat allocates space for the concatenation of all of its dataframes, and if you do this in a loop it can lead to O(N**2) copying and some major slowdowns. Thus I am currently first building a big list of dataframes (loaded from files), concatenating them all at once, and then joining the two big dataframes:

df_list = []
for file in my_pickle_files_snsr1:  # O(M) loop over M files
    df_list.append(pd.read_pickle(file))  # O(1) append, M times
df_snsr1 = pd.concat(df_list)  # O(N) copies of N records
# repeat for sensor 2 (df_snsr2)
df_snsr1.join(df_snsr2, on=['some', 'columns'])  # O(dunno, maybe bears?)

I am unable to find anything about execution speed in the documentation on pandas.DataFrame.join. Is it O(N)? O(N**2)? My thought is that if it is similar order to pandas.concat, then it really doesn't matter what order I do the two operations in. If it is O(N**2), however, then it will likely be more efficient for me to join many small dataframes and then concatenate them rather than concat and then join. The overall operation takes long enough that it is worth-while for me to ask the question on here, so "run it and see" isn't going to work.

Does anybody know what algorithm join is using and what its execution big-O order is? Or does anybody have any other suggestions on getting the most-efficient combination of join and concat?

Engineero
  • 12,340
  • 5
  • 53
  • 75
  • 1
    While I am also interested in the answer to your question, I recommend taking a look at the [solution](http://dask.pydata.org/en/latest/dataframe-api.html#dask.dataframe.read_csv) that dask provides for exactly this problem (i.e. reading lots of files into one DataFrame). It doesn't really support reading lots of pickle files but csv, parquet, hdf and many other file types are really easy to read in this way. `import dask.dataframe as dd; df_snsr1 = dd.read_csv(list_of_csv_files_or_path_regex); df_snsr1 = df_snsr1.compute()` – tobsecret Aug 06 '18 at 21:55

1 Answers1

3

I think it depends on the options you pass to join (e.g. the type of join and whether to sort).

When using the default how='left', it appears that the result is sorted, at least for single index (the doc only specifies the order of the output for some of the how methods, and inner isn't one of them). In any case, sort is O(n log n). Each index lookup is O(1) and there are O(n) of them. So, in that case, O(n log n) dominates.

By contrast, in the how='inner' case, it is specified that order of the calling DataFrame is kept. In that case, we would expect O(n) (both for a possible set intersection and for the index lookup and insertion).

In either case, as the size gets larger, various issues of cache-locality (or lack thereof) start creeping up on you, and the actual time spent accessing a large memory area in random access will start to dominate. The above is only regarding the operation complexity.

As mentioned elsewhere, for larger datasets, Dask is a way to go, or Spark.


But what do you say we test it (at least the how='left' case)? The code below is a bit more verbose than I would have liked (and the name generation is just plain silly), but it does just that. Essentially, it makes two DFs with random names, unordered, and with 1 - replace_fraction fraction in common; then it joins them while measuring the time used.

from IPython.core.magics.execution import _format_time as walltime

def make_names(n):
    names = [
        f'{x}{y}{z}' for (x, y), z in zip(
            np.random.choice(['foo', 'bar', 'hi'], (n, 2)),
            np.random.randint(0, n, size=n))
    ]
    return names

def work(n, replace_fraction=0.1):
    a_names = make_names(n)
    replace_n = int(n * replace_fraction)
    b_names = make_names(replace_n) + list(np.random.choice(a_names, size=n - replace_n, replace=False))
    np.random.shuffle(b_names)
    a = pd.DataFrame({
        'name': a_names,
        'v': np.random.uniform(size=n),
        'w': np.random.uniform(size=n),
    }).set_index('name')
    b = pd.DataFrame({
        'name': b_names,
        'v': np.random.uniform(size=n),
        'w': np.random.uniform(size=n),
    }).set_index('name')

    t0 = time.time()
    df = a.join(b, rsuffix='_r')
    dt = time.time() - t0
    return a, b, df, dt

Example: try work(4, .5).

Now, get some time measurements for a geometric series of sizes:

sizes = (2**np.arange(10, 23, .5)).astype(int)
times = []
for n in sizes:
    a, b, df, dt = work(n)
    times.append(dt)
    print(f'{n}: {walltime(dt)}')

# out:
1024: 2.9 ms
1448: 4.78 ms
2048: 4.37 ms
...
2965820: 18.2 s
4194304: 30.2 s
5931641: 44.8 s

Fit for n log n:

from numpy.polynomial.polynomial import polyfit

n = np.array(sizes)
t = np.array(times)
b, m = polyfit(n * np.log(n), t, 1)

plt.plot(n/1e6, t, '.')
plt.plot(n/1e6, b + m * n * np.log(n), '-')
plt.xlabel('size [M]')
plt.ylabel('time [s]')
plt.show()

enter image description here

(side note: scipy.optimize.nnls with all terms n, log n, n log n, 1 finds all coefficients 0 except for n log n, so the above is fine).

Pierre D
  • 24,012
  • 7
  • 60
  • 96