0

I'm currently in need for a better method for doing some calculations in a quite tedious fashion. And I'd like to change that, due to some of the dataframes that will be processed can have a large size.

I would like to do some comparisons for each row in one dataframe to each row in another dataframe. This by nature scales at x^2, but I was wondering if there were any good way of doing this?

df1 = pd.DataFrame([
    {"value": 1, "timestamp": "1970-01-01T00:00:00z"},
    {"value": 2, "timestamp": "1970-01-01T00:00:00z"},
    ...
])

df2 = pd.DataFrame([
    {"value": 3, "timestamp": "1970-01-01T00:00:03z"},
    {"value": 4, "timestamp": "1970-01-01T00:00:04z"},
    ...
])

With these two dataframes, I would like to be able to make calculations for 1-3, 1-4, 2-3 and 2-4

The calculations that needs to take place, involve comparing the columns for each 'row-pair'.My first method of doing this was just with the use of itertools' module function product. I.E.

for (_,x), (_,y) in product(X, Y):
    func(x, y)

This of course doesn't scale so nicely. In desperation I've come up with a new 'solution' that I can't imagine is more scalable

df1.apply(lambda x: df2.apply(lambda y: func(x, y), axis=1), axis=1)

This also doesn't scale well, in fact in the tests that I've done, it's actually slower.

This the main idea of what I'd like to do, and certainly not the optimal way of doing it..

df1 = pd.DataFrame([
    {"value": 1, "timestamp": "1970-01-01T00:00:01z"},
    {"value": 2, "timestamp": "1970-01-01T00:00:02z"},
    ...
])

df2 = pd.DataFrame([
    {"value": 3, "timestamp": "1970-01-01T00:00:03z"},
    {"value": 4, "timestamp": "1970-01-01T00:00:04z"},
    ...
])

def func(x, y, lst):
    lst.append(x["value"] + y["value"])

results = []

df1.apply(lambda x: df2.apply(lambda y: func(x, y, results), axis=1), axis=1)

print(results) # [4, 5, 5, 6, ...]

Edit: In response to the comment Panda Kim came with. I've come up with a new method. Unfortunately it doesn't seem to be any faster though.

'New method'

def apply_func(row, df_dict):
   start = row["ts_min_x"] - row["ts_min_y"]
   end = row["ts_max_x"] - row["ts_max_y"]
   start_ts = max([row["ts_min_x"], row["ts_min_y"]])
   end_ts = min([row["ts_max_x"], row["ts_max_y"]])
   duration = (end_ts - start_ts).total_seconds()

   if row["id_x"] not in df_dict:
       df_dict[row["id_x"]] = {}

   if row["id_y"] not in df_dict.get(row["id_x"]):
       df_dict[row["id_x"]][row["id_y"]] = []

   df_dict[row["id_x"]][row["id_y"]].append(
       (start, end, start_ts, duration)
   )
df1 = pd.DataFrame([
   {"id": 1, "active": True,  "ts_min": 1,  "ts_max": 10},
   {"id": 2, "active": False, "ts_min": 10, "ts_max": 12},
   {"id": 3, "active": True,  "ts_min": 12, "ts_max": 19},
])

df2 = pd.DataFrame([
   {"id": 4, "active": True,  "ts_min": 4,  "ts_max": 9},
   {"id": 5, "active": True,  "ts_min": 9,  "ts_max": 16},
   {"id": 6, "active": False, "ts_min": 16, "ts_max": 19},
])

df_dict = {}

merged = df1.merge(df2, how="cross")

merged.apply(lambda x: apply_func(x, df_dict), axis=1)

res_df = pd.DataFrame(df_dict)
res_df = res_df.dropna(axis=1, how="all").dropna(axis=0, how="all")

This is the method I came up with. Is it possible to do what's being done here, but in a faster way?

  • 1
    Can you give a more concrete example of the real function? Ideally a reproducible one – mozway Nov 08 '22 at 11:51
  • Outer product of numpy.ufunc can be utilized to speed up calculations. However it also could require a lot of memory because of huge number of rows. E.g. numpy.subtract.outer(df1.values, df2.values) for the subtraction function. https://numpy.org/doc/stable/reference/generated/numpy.ufunc.outer.html – Raibek Nov 08 '22 at 12:00
  • Maybe this helps (possible duplicate): https://stackoverflow.com/questions/17095101/compare-two-dataframes-and-output-their-differences-side-by-side – tturbo Nov 08 '22 at 12:14

2 Answers2

1

merge cross get cartesian product:

df1.merge(df2, how='cross')

output:

    value_x value_y
0   1       3
1   1       4
2   2       3
3   2       4

then apply your func:

df1.merge(df2, how='cross').apply(lambda x: func(x['value_x'], x['value_y']), axis=1)



example

func : f(x, y) = xy

df1.merge(df2, how='cross').apply(lambda x: x['value_x'] * x['value_y'], axis=1)

output:

0    3
1    4
2    6
3    8
dtype: int64
Panda Kim
  • 6,246
  • 2
  • 12
  • I seems like this is the kind of calculations I would need. Though the implementation I've made doesn't seem to be the any improvement to the pure row-pair iterations – Martin Lange Nov 09 '22 at 06:32
  • This was what i ended up with, just some modifications to the lambda function and so on. Thanks! – Martin Lange Nov 10 '22 at 20:32
0

Numpy ufunc.outer can be utilized to relevant ufunc function such as numpy.add, numpy.multiply, etc.

import numpy as np

np.add.outer(df1.values, df2.values).flatten()

array([4, 5, 5, 6], dtype=int64)
Raibek
  • 558
  • 3
  • 6