3

I recently started using python at work instead of SQL. I don't have access to any servers, nor will IT allow me to install a local instance of SQL SERVER EXPRESS. The datasets are too large to use Excel or Access.

I'm using my limited knowledge of python as a workaround. I've successfully analyzed other datasets, but this one presents a new challenge. I have a very large dataset (25 mill+ rows) and I would like to perform the equivalent to this: IN T-SQL:

    SELECT
     A.TransactionDate,
     B.StartDate,
     B.EndDate,
     B.RefValue
    FROM table1 AS A
    LEFT JOIN table2 AS B 
     ON A.TransactionDate BETWEEN B.StartDate AND B.EndDate

OR

ON A.Date1 >= B.StartDate AND A.Date <= B.EndDate

Sample Data:

import pandas as pd
df1 = pd.DataFrame(
    {
        'TranDate': ['2018/01/05', '2018/02/07', '2018/03/24']
        ,'Loc': [4000, 5000, 4200]
    }
)

df2 = pd.DataFrame(
    {
        'StartDate': ['2018/01/01', '2018/02/01', '2018/03/01']
        ,'EndDate': ['2018/01/31','2018/02/28', '2018/03/30']
        ,'PP': ['01', '02', '03']
    }
)

Expected Dataframe:

dfR = pd.DataFrame(
    {
        'TranDate': ['2018/01/05', '2018/02/07', '2018/03/24']
        ,'Loc': [4000, 5000, 4200]
        ,'PP': ['01', '02', '03']
    }
)

I tried using the pandas.merge() statement, but there is no option for an inequality.

How can I recreate the previous statement in python?

Erick 3E
  • 115
  • 2
  • 13
  • For this question it is crucial to include some example data and it would be most optimal if you would also include expected output example. – Erfan Mar 11 '19 at 20:29
  • Thank you, I will update to include an example – Erick 3E Mar 11 '19 at 20:30
  • 1
    Note: with example I mean a small dataframe created which we can copy and run ourselves. See more here: https://stackoverflow.com/questions/20109391/how-to-make-good-reproducible-pandas-examples – Erfan Mar 11 '19 at 20:31
  • Thank you! I've updated per your suggestion – Erick 3E Mar 11 '19 at 21:02
  • Thanks to Parfait, who pointed me to a very similar question. There, I found how to use SQL inside python. Might not be the best approach, but I am much more comfortable with SQL, so I used that to solve my problem :) – Erick 3E Mar 12 '19 at 18:23

1 Answers1

0

Well, you could start by trying this:

dfR = df1.copy()
dfR.loc[dfR['TranDate'].between(df2['StartDate'], df2['EndDate']), 'PP'] = df2['PP']

Maybe this will work for you.

Victor Valente
  • 761
  • 9
  • 24