Optimize dataframe filtering on large datasets, pandas

Question

I have a little challenge here and to be honest, I have absolutely no idea how to handle it.

I have this dataframe composed of 660,000 rows and about 50 columns. I need to filter this dataframe very frequently and retrieve the filtered dataframe as fast as possible (goal is to have a processing time <1second). I'd like to be able to run that locally on a laptop, therefore my "processing power" is limited.

I have multiple inputs to filter the dataframe, some are set manually (see input 1) some are retrieved from another script (see input 2, the other script is not included in the code here for simplicity).

I was hoping to simple filter through the dataset using df[(df.column == filtervalue)]. However, it seems that the processing time is way too long.

Therefore, I am wondering whether there are some technics to optimize such processing time or if on the contrary the only way to optimize that is to go with a server that has a good CPU / Memory capacity?

Thanks for the help

import pandas as pd


df = pd.read_csv('xxxxxxxx', sep=";", dtype={"id": str,"dataset1": str,"dataset2":str,"myposition":str,"bet_1_preflop":float,"bet_2_preflop":float,"bet_3_preflop":float,"bet_1_flop":float,"bet_2_flop":float,
                                "bet_3_flop":float,"bet_1_turn":float ,"bet_2_turn":float,"bet_3_turn":float,"bet_1_river":float,"bet_2_river":float, "bet_3_river":float,
                                "myhand":str,"myposition":str,"cards_flop":str,"cards_turn":str,"cards_river":str,"action1_preflop":str," action2_preflop":str, 
                                "action3_preflop":str,"action4_preflop":str, "action1_flop":str, "action2_flop":str, "action3_flop":str,"action4_flop":str,"action1_turn":str,
                                "action2_turn":str, "action3_turn":str, "action4_turn":str, "action1_river":str,"action2_river":str,  "action3_river":str,  "action4_river":str,
                                "action1_preflop_binary":'Int64', "action2_preflop_binary":'Int64', "action3_preflop_binary":'Int64', "action4_preflop_binary":'Int64', 
                                "action1_flop_binary":'Int64',"action2_flop_binary":'Int64',  "action3_flop_binary":'Int64',  "action4_flop_binary":'Int64',  "action1_turn_binary":'Int64',
                                "action2_turn_binary":'Int64',  "action3_turn_binary":'Int64',  "action4_turn_binary":'Int64',"action1_river_binary":'Int64', "action2_river_binary":'Int64',
                                "action3_river_binary":'Int64', "action4_river_binary":'Int64', "tiers":'Int64',"assorties":str,
                                "besthand_flop":str,"checker_flop":float,"handtype_flop":str,"topsuite_flop":'Int64',"topcolor_flop":'Int64',"besthand_turn":str,"checker_turn":float,"handtype_turn":str,
                                "topsuite_turn":'Int64',"topcolor_turn":'Int64',"besthand_river":str,"checker_river":float,"handtype_river":str,"topsuite_river":'Int64',"topcolor_river":'Int64'})
df = df.reset_index()

#Inputs for filters 1
myposition ="sb"

myhand = "ackc"
flop = "ad9d4h"
turn = "8d"
river = "th"

a1_preflop = "r"
a2_preflop = "r"
a3_preflop = "c"
a4_preflop = ""

a1_flop = "r"
a2_flop = "f"
a3_flop = ""
a4_flop = ""

a1_turn = ""
a2_turn = ""
a3_turn = ""
a4_turn = ""

a1_river = ""
a2_river = ""
a3_river = ""
a4_river = ""

#Inputs for filters 2 (from a different script)

tiers
assorties_status
best_allhands_flop[0]
best_allhands_flop[1]
best_allhands_flop[2]
highest_suite_flop
highest_color_flop
best_allhands_turn[0]
best_allhands_turn[1]
best_allhands_turn[2]  
highest_suite_turn
highest_color_turn
best_allhands_river[0]
best_allhands_river[1]
best_allhands_river[2]
highest_suite_river
highest_color_river

#filtre_preflop_a1 = df[(df.myposition == myposition) & (df.tiers == tiers) & (df.assorties == assorties_status) & (df.action1_preflop == a1_preflop)]
#filtre_preflop_a2 = df[(df.myposition == myposition) & (df.tiers == tiers) & (df.assorties == assorties_status) & (df.action1_preflop == a1_preflop) & (df.action2_preflop == a2_preflop)]
#filtre_preflop_a3 = df[(df.myposition == myposition) & (df.tiers == tiers) & (df.assorties == assorties_status) & (df.action1_preflop == a1_preflop) & (df.action2_preflop == a2_preflop) & (df.action3_preflop == a3_preflop)]
#filtre_preflop_a4 = df[(df.myposition == myposition) & (df.tiers == tiers) & (df.assorties == assorties_status) & (df.action1_preflop == a1_preflop) & (df.action2_preflop == a2_preflop) & (df.action3_preflop == a3_preflop) & (df.action4_preflop == a4_preflop)]

You can use isin() method. It will be faster. Look my answer about filtering pandas https://stackoverflow.com/a/69195120/16591526 — padu, Nov 24 '22 at 20:36
@padu I have another question. Is it possible / have you tried to optimize filtering (even when using the "isin" method) by either doing some multi-threading or by forcing multi-core CPU allocation to speed up the task? I am not even sure a filtering task could work with that sort of approach, but I'd rather ask — Raphaël Ambit, Nov 25 '22 at 09:51

Optimize dataframe filtering on large datasets, pandas

0 Answers0

Linked