I am trying to use python polars over pandas sql for a large dataframe as I am running into memory errors. There are two where conditions that are utilized in this dataframe but can't get the syntax right.
Here is what the data looks like:
Key | Field | DateColumn |
---|---|---|
1234 | Plumb | 2020-02-01 |
1234 | Plumb | 2020-03-01 |
1234 | Pear | 2020-04-01 |
import pandas as pd
import datetime as dt
import pandasql as ps
d = {'Key':
[1234, 1234, 1234, 1234, 1234, 1234, 1234,
1234, 1234, 1234, 1234, 2456, 2456, 2456, 2456, 2456, 2456, 2456, 2456, 2456, 2456, 2456, 3754, 3754, 3754, 3754, 3754, 3754, 3754, 3754, 3754, 3754, 3754],
'Field':[
"Plumb", "Plumb", "Pear", "Plumb", "Orange", "Pear", "Plumb", "Plumb", "Pear", "Apple", "Plumb", "Orange", "Orange", "Apple", "Apple", "Pear", "Apple", "Plumb", "Plumb", "Orange", "Orange", "Pear", "Plumb", "Pear", "Plumb", "Pear", "Apple", "Plumb", "Orange", "Pear", "Apple", "Pear", "Apple"],
'DateColumn':[
'2020-02-01', '2020-03-01', '2020-04-01', '2020-05-01', '2020-06-01', '2020-07-01', '2020-08-01', '2020-09-01', '2020-10-01', '2020-11-01', '2020-12-01', '2020-02-01', '2020-03-01', '2020-04-01', '2020-05-01', '2020-06-01', '2020-07-01', '2020-08-01', '2020-09-01', '2020-10-01', '2020-11-01', '2020-12-01', '2020-02-01', '2020-03-01', '2020-04-01', '2020-05-01', '2020-06-01', '2020-07-01', '2020-08-01', '2020-09-01', '2020-10-01', '2020-11-01', '2020-12-01'
]}
df = pd.DataFrame(data=d)
df['DateColumn'] = pd.to_datetime(df['DateColumn'])
df['PreviousMonth'] = df['DateColumn'] - pd.DateOffset(months=1)
df_output = ps.sqldf("""
select
a.Key
,a.Field
,b.Field as PreviousField
,a.DateColumn
,b.DateColumn as PreviousDate
from df as a, df as b
where a.Key = b.Key
and b.DateColumn = a.PreviousMonth
""")
print(df_output.head())
Key Field DateColumn PreviousDate
0 1234 Plumb 2020-03-01 00:00:00.000000 2020-02-01 00:00:00.000000
1 1234 Pear 2020-04-01 00:00:00.000000 2020-03-01 00:00:00.000000
2 1234 Plumb 2020-05-01 00:00:00.000000 2020-04-01 00:00:00.000000
3 1234 Orange 2020-06-01 00:00:00.000000 2020-05-01 00:00:00.000000
4 1234 Pear 2020-07-01 00:00:00.000000 2020-06-01 00:00:00.000000
I have tried to do
data_output = df.join(df, left_on='Key', right_on='Key')
But unable to find a good example on how to put the two conditions on the join condition.