If I have two tables, I can easily combine them in SQL using something like
SELECT a.*, b.* FROM table_1 a, table_2 b
WHERE (a.id < 1000 OR b.id > 700)
AND a.date < b.date
AND (a.name = b.first_name OR a.name = b.last_name)
AND (a.location = b.origin OR b.destination = 'home')
and there could be many more conditions. Note that this is just an example and the set of conditions may be anything. The two easiest solutions in pandas that support any set of conditions are:
- Compute a cross product of the tables and then filter one condition at a time.
- Loop over one
DataFrame
(apply
,itertuples
, ...) and filter the secondDataFrame
in each iteration. Append the filteredDataFrames
from each iteration.
In case of huge datasets (at least a few million rows per DataFrame
), the first solution is impossible because of the required memory and the second one is considered an anti-pattern (https://stackoverflow.com/a/55557758/2959697). Either solution will be rather slow.
What is the pandaic way to proceed in this general case?
Note that I am not only interested in a solution to this particular problem but in the general concept of how to translate these types of statements. Can I use pandas.eval
? Is it possible to perform a "conditional merge"? Etc.