I'm currently in need for a better method for doing some calculations in a quite tedious fashion. And I'd like to change that, due to some of the dataframes that will be processed can have a large size.
I would like to do some comparisons for each row in one dataframe to each row in another dataframe. This by nature scales at x^2, but I was wondering if there were any good way of doing this?
df1 = pd.DataFrame([
{"value": 1, "timestamp": "1970-01-01T00:00:00z"},
{"value": 2, "timestamp": "1970-01-01T00:00:00z"},
...
])
df2 = pd.DataFrame([
{"value": 3, "timestamp": "1970-01-01T00:00:03z"},
{"value": 4, "timestamp": "1970-01-01T00:00:04z"},
...
])
With these two dataframes, I would like to be able to make calculations for 1-3, 1-4, 2-3 and 2-4
The calculations that needs to take place, involve comparing the columns for each 'row-pair'.My first method of doing this was just with the use of itertools' module function product. I.E.
for (_,x), (_,y) in product(X, Y):
func(x, y)
This of course doesn't scale so nicely. In desperation I've come up with a new 'solution' that I can't imagine is more scalable
df1.apply(lambda x: df2.apply(lambda y: func(x, y), axis=1), axis=1)
This also doesn't scale well, in fact in the tests that I've done, it's actually slower.
This the main idea of what I'd like to do, and certainly not the optimal way of doing it..
df1 = pd.DataFrame([
{"value": 1, "timestamp": "1970-01-01T00:00:01z"},
{"value": 2, "timestamp": "1970-01-01T00:00:02z"},
...
])
df2 = pd.DataFrame([
{"value": 3, "timestamp": "1970-01-01T00:00:03z"},
{"value": 4, "timestamp": "1970-01-01T00:00:04z"},
...
])
def func(x, y, lst):
lst.append(x["value"] + y["value"])
results = []
df1.apply(lambda x: df2.apply(lambda y: func(x, y, results), axis=1), axis=1)
print(results) # [4, 5, 5, 6, ...]
Edit: In response to the comment Panda Kim came with. I've come up with a new method. Unfortunately it doesn't seem to be any faster though.
'New method'
def apply_func(row, df_dict):
start = row["ts_min_x"] - row["ts_min_y"]
end = row["ts_max_x"] - row["ts_max_y"]
start_ts = max([row["ts_min_x"], row["ts_min_y"]])
end_ts = min([row["ts_max_x"], row["ts_max_y"]])
duration = (end_ts - start_ts).total_seconds()
if row["id_x"] not in df_dict:
df_dict[row["id_x"]] = {}
if row["id_y"] not in df_dict.get(row["id_x"]):
df_dict[row["id_x"]][row["id_y"]] = []
df_dict[row["id_x"]][row["id_y"]].append(
(start, end, start_ts, duration)
)
df1 = pd.DataFrame([
{"id": 1, "active": True, "ts_min": 1, "ts_max": 10},
{"id": 2, "active": False, "ts_min": 10, "ts_max": 12},
{"id": 3, "active": True, "ts_min": 12, "ts_max": 19},
])
df2 = pd.DataFrame([
{"id": 4, "active": True, "ts_min": 4, "ts_max": 9},
{"id": 5, "active": True, "ts_min": 9, "ts_max": 16},
{"id": 6, "active": False, "ts_min": 16, "ts_max": 19},
])
df_dict = {}
merged = df1.merge(df2, how="cross")
merged.apply(lambda x: apply_func(x, df_dict), axis=1)
res_df = pd.DataFrame(df_dict)
res_df = res_df.dropna(axis=1, how="all").dropna(axis=0, how="all")
This is the method I came up with. Is it possible to do what's being done here, but in a faster way?