0

I was asked to do a merge between two dataframes, The first contains sales orders (1m rows) and the second contains discounts. applied to that sales orders, established according to a date range.

Both are joined by an ID (A975 ID) but I can't find an efficient way to do that merge without running out of memory.

My idea so far is to do an "outer" merge and once that's done(*), filter the "Sales Order Date" column by the value that is within the date range "DATAB"(start) - "DATBI"(end)

I was able to do it in PowerQuery with "SelectRows" but it takes too long, if I can replicate the same procedure in Python and gain a few minutes of processing it would be more than enough.

I know that an outer join generates tons of duplicated rows with only the date changed, but I don't know what else to do.

IMG Table to Merge

Ravindra S
  • 6,302
  • 12
  • 70
  • 108
  • See if any of this suggestion help(comment section) https://stackoverflow.com/questions/40860457/improve-pandas-merge-performance – teedak8s Jun 24 '22 at 05:36
  • you can also investigate dask module if it is allowed to use one in your proceedures – NoobVB Jun 24 '22 at 08:00

1 Answers1

0

The solution was merging and pd.query

df = pd.merge(all_together, A904, how="outer", left_on='ID A904', right_on='ID A904')\
                    .query('DATAB <= ERDAT <= DATBI')