1

I am trying to use python polars over pandas sql for a large dataframe as I am running into memory errors. There are two where conditions that are utilized in this dataframe but can't get the syntax right.

Here is what the data looks like:

Key Field DateColumn
1234 Plumb 2020-02-01
1234 Plumb 2020-03-01
1234 Pear 2020-04-01
import pandas as pd
import datetime as dt
import pandasql as ps

d = {'Key':
          [1234,    1234,   1234,   1234,   1234,   1234,   1234,   

1234,   1234,   1234,   1234,   2456,   2456,   2456,   2456,   2456,   2456,   2456,   2456,   2456,   2456,   2456,   3754,   3754,   3754,   3754,   3754,   3754,   3754,   3754,   3754,   3754,   3754],
         'Field':[
              "Plumb",  "Plumb",    "Pear", "Plumb",    "Orange",   "Pear", "Plumb",    "Plumb",    "Pear", "Apple",    "Plumb",    "Orange",   "Orange",   "Apple",    "Apple",    "Pear", "Apple",    "Plumb",    "Plumb",    "Orange",   "Orange",   "Pear", "Plumb",    "Pear", "Plumb",    "Pear", "Apple",    "Plumb",    "Orange",   "Pear", "Apple",    "Pear", "Apple"],
         'DateColumn':[
              '2020-02-01', '2020-03-01',   '2020-04-01',   '2020-05-01',   '2020-06-01',   '2020-07-01',   '2020-08-01',   '2020-09-01',   '2020-10-01',   '2020-11-01',   '2020-12-01',   '2020-02-01',   '2020-03-01',   '2020-04-01',   '2020-05-01',   '2020-06-01',   '2020-07-01',   '2020-08-01',   '2020-09-01',   '2020-10-01',   '2020-11-01',   '2020-12-01',   '2020-02-01',   '2020-03-01',   '2020-04-01',   '2020-05-01',   '2020-06-01',   '2020-07-01',   '2020-08-01',   '2020-09-01',   '2020-10-01',   '2020-11-01',   '2020-12-01'

     ]}

df = pd.DataFrame(data=d)

df['DateColumn'] = pd.to_datetime(df['DateColumn'])
df['PreviousMonth'] = df['DateColumn'] - pd.DateOffset(months=1)


df_output = ps.sqldf("""
select
a.Key
,a.Field
,b.Field as PreviousField
,a.DateColumn
,b.DateColumn as PreviousDate

from df as a, df as b
where a.Key = b.Key
and b.DateColumn = a.PreviousMonth

""")

print(df_output.head())

    Key   Field                  DateColumn                PreviousDate
0  1234   Plumb  2020-03-01 00:00:00.000000  2020-02-01 00:00:00.000000
1  1234    Pear  2020-04-01 00:00:00.000000  2020-03-01 00:00:00.000000
2  1234   Plumb  2020-05-01 00:00:00.000000  2020-04-01 00:00:00.000000
3  1234  Orange  2020-06-01 00:00:00.000000  2020-05-01 00:00:00.000000
4  1234    Pear  2020-07-01 00:00:00.000000  2020-06-01 00:00:00.000000

I have tried to do

data_output = df.join(df, left_on='Key', right_on='Key')

But unable to find a good example on how to put the two conditions on the join condition.

Dean MacGregor
  • 11,847
  • 9
  • 34
  • 72
Drthm1456
  • 409
  • 9
  • 17
  • Look for the section **Merging on multiple columns** in [this answer](https://stackoverflow.com/a/53645883/9274732) – Ben.T Feb 07 '23 at 19:39
  • 1
    Please create a copy/pastable df definition with sample data as well as an example of the result you're looking for. – Dean MacGregor Feb 07 '23 at 19:41

1 Answers1

2

Let's accomplish everything using Polars.

import polars as pl
df = (
    pl.DataFrame(d)
    .with_columns(
        pl.col('DateColumn').str.strptime(pl.Date)
    )
)

(
    df
    .join(
        df
        .with_columns(pl.col('DateColumn').alias('PreviousDate'))
        .rename({'Field': 'PreviousField'}),
        left_on=['Key', 'DateColumn'],
        right_on=['Key', pl.col('DateColumn').dt.offset_by('1mo')],
        how="inner"
    )
)
shape: (30, 5)
┌──────┬────────┬────────────┬───────────────┬──────────────┐
│ Key  ┆ Field  ┆ DateColumn ┆ PreviousField ┆ PreviousDate │
│ ---  ┆ ---    ┆ ---        ┆ ---           ┆ ---          │
│ i64  ┆ str    ┆ date       ┆ str           ┆ date         │
╞══════╪════════╪════════════╪═══════════════╪══════════════╡
│ 1234 ┆ Plumb  ┆ 2020-03-01 ┆ Plumb         ┆ 2020-02-01   │
│ 1234 ┆ Pear   ┆ 2020-04-01 ┆ Plumb         ┆ 2020-03-01   │
│ 1234 ┆ Plumb  ┆ 2020-05-01 ┆ Pear          ┆ 2020-04-01   │
│ 1234 ┆ Orange ┆ 2020-06-01 ┆ Plumb         ┆ 2020-05-01   │
│ 1234 ┆ Pear   ┆ 2020-07-01 ┆ Orange        ┆ 2020-06-01   │
│ 1234 ┆ Plumb  ┆ 2020-08-01 ┆ Pear          ┆ 2020-07-01   │
│ 1234 ┆ Plumb  ┆ 2020-09-01 ┆ Plumb         ┆ 2020-08-01   │
│ 1234 ┆ Pear   ┆ 2020-10-01 ┆ Plumb         ┆ 2020-09-01   │
│ 1234 ┆ Apple  ┆ 2020-11-01 ┆ Pear          ┆ 2020-10-01   │
│ 1234 ┆ Plumb  ┆ 2020-12-01 ┆ Apple         ┆ 2020-11-01   │
│ 2456 ┆ Orange ┆ 2020-03-01 ┆ Orange        ┆ 2020-02-01   │
│ 2456 ┆ Apple  ┆ 2020-04-01 ┆ Orange        ┆ 2020-03-01   │
│ 2456 ┆ Apple  ┆ 2020-05-01 ┆ Apple         ┆ 2020-04-01   │
│ 2456 ┆ Pear   ┆ 2020-06-01 ┆ Apple         ┆ 2020-05-01   │
│ 2456 ┆ Apple  ┆ 2020-07-01 ┆ Pear          ┆ 2020-06-01   │
│ 2456 ┆ Plumb  ┆ 2020-08-01 ┆ Apple         ┆ 2020-07-01   │
│ 2456 ┆ Plumb  ┆ 2020-09-01 ┆ Plumb         ┆ 2020-08-01   │
│ 2456 ┆ Orange ┆ 2020-10-01 ┆ Plumb         ┆ 2020-09-01   │
│ 2456 ┆ Orange ┆ 2020-11-01 ┆ Orange        ┆ 2020-10-01   │
│ 2456 ┆ Pear   ┆ 2020-12-01 ┆ Orange        ┆ 2020-11-01   │
│ 3754 ┆ Pear   ┆ 2020-03-01 ┆ Plumb         ┆ 2020-02-01   │
│ 3754 ┆ Plumb  ┆ 2020-04-01 ┆ Pear          ┆ 2020-03-01   │
│ 3754 ┆ Pear   ┆ 2020-05-01 ┆ Plumb         ┆ 2020-04-01   │
│ 3754 ┆ Apple  ┆ 2020-06-01 ┆ Pear          ┆ 2020-05-01   │
│ 3754 ┆ Plumb  ┆ 2020-07-01 ┆ Apple         ┆ 2020-06-01   │
│ 3754 ┆ Orange ┆ 2020-08-01 ┆ Plumb         ┆ 2020-07-01   │
│ 3754 ┆ Pear   ┆ 2020-09-01 ┆ Orange        ┆ 2020-08-01   │
│ 3754 ┆ Apple  ┆ 2020-10-01 ┆ Pear          ┆ 2020-09-01   │
│ 3754 ┆ Pear   ┆ 2020-11-01 ┆ Apple         ┆ 2020-10-01   │
│ 3754 ┆ Apple  ┆ 2020-12-01 ┆ Pear          ┆ 2020-11-01   │
└──────┴────────┴────────────┴───────────────┴──────────────┘

Note that we use an Expression in the right_on columns to generate our offset date column on-the-fly.