0

If I have two tables, I can easily combine them in SQL using something like

SELECT a.*, b.* FROM table_1 a, table_2 b
WHERE (a.id < 1000 OR b.id > 700)
  AND a.date < b.date
  AND (a.name = b.first_name OR a.name = b.last_name)
  AND (a.location = b.origin OR b.destination = 'home')

and there could be many more conditions. Note that this is just an example and the set of conditions may be anything. The two easiest solutions in pandas that support any set of conditions are:

  1. Compute a cross product of the tables and then filter one condition at a time.
  2. Loop over one DataFrame (apply, itertuples, ...) and filter the second DataFrame in each iteration. Append the filtered DataFrames from each iteration.

In case of huge datasets (at least a few million rows per DataFrame), the first solution is impossible because of the required memory and the second one is considered an anti-pattern (https://stackoverflow.com/a/55557758/2959697). Either solution will be rather slow.

What is the pandaic way to proceed in this general case?

Note that I am not only interested in a solution to this particular problem but in the general concept of how to translate these types of statements. Can I use pandas.eval? Is it possible to perform a "conditional merge"? Etc.

staxyz
  • 167
  • 5
  • always helpful to share sample data, with expected output – sammywemmy Dec 15 '21 at 23:34
  • If you were really dealing with massive data sets, you could start using pyspark and this wouldn't be an issue – Chris Dec 15 '21 at 23:39
  • There are certainly solutions without `pandas` that will work, but in reality we can not always change the software stack. @sammywemmy In principle yes, but I am looking for a general concept. – staxyz Dec 15 '21 at 23:41
  • a sample data helps with providing possible options ... for the question, a more efficient route would be to do an isin, or a merge for the equality signs, before filtering on the non-equi relational operators. Again, with a sample data, solutions can be proferred, and you can then build off the ideas to your use case. Pandas is more of a brick upon brick library, compared to SQL where you tell it what to do and it figures it out for you – sammywemmy Dec 15 '21 at 23:47
  • I guess SQL under the hood does the 2nd solution, which in case of pandas will be slow. When it comes to bottlenecks and pandas you must choose between readability of the code and performance. It will be slow and readable with pandas or a bit ugly and fast by switching to numpy, numba, some multiprocessing, etc. – dankal444 Dec 16 '21 at 21:45

0 Answers0