1

I have 2 CSV files, as below.

  1. I want a new column Difference, where...
    • if a mobile number appears within the date range of Book_date...App_date: Difference = difference App_date and Occur_date
    • or NaN if it doesn't occur in that date range.
  2. I also want to filter it based on a unique category and mobile_number

csv_1

Mobile_Number    Book_Date       App_Date

503477334    2018-10-12       2018-10-18
506002884    2018-10-12       2018-10-19
501022162    2018-10-12       2018-10-16
503487338    2018-10-13       2018-10-13
506012887    2018-10-13       2018-10-21
503427339    2018-10-14       2018-10-17

csv_2

Mobile_Number    Occur_Date    

503477334        2018-10-16
506002884        2018-10-21
501022162        2018-10-15
503487338        2018-10-13
501428449        2018-10-18
506012887        2018-10-14

I want a new column in csv_1, where if a mobile number appears within the date range of Book_date and App_date in csv_2, the difference between App_date and the Occur_date or NaN if it doesn't occur in that date range. The output should be

Output

Mobile_Number    Book_Date       App_Date   Difference

503477334    2018-10-12       2018-10-18       2
506002884    2018-10-12       2018-10-19      -2
501022162    2018-10-12       2018-10-16       1
503487338    2018-10-13       2018-10-13       0
506012887    2018-10-13       2018-10-21       7 
503427339    2018-10-14       2018-10-17       NaN

EDIT:

If I want to filter it based on a unique category and mobile_number on the above two csv files. How to do the same?

csv_1

Category     Mobile_Number   Book_Date       App_Date

A              503477334    2018-10-12       2018-10-18
B              503477334    2018-10-07       2018-10-16
C              501022162    2018-10-12       2018-10-16
A              503487338    2018-10-13       2018-10-13
C              506012887    2018-10-13       2018-10-21
E              503427339    2018-10-14       2018-10-17

csv_2

Category     Mobile_Number    Occur_Date    

A              503477334        2018-10-16
B              503477334        2018-10-13
A              501022162        2018-10-15
A              503487338        2018-10-13
F              501428449        2018-10-18
C              506012887        2018-10-14

I want the output to be filtered based on the Mobile_Number and the Category

Output

Category     Mobile_Number    Book_Date       App_Date   Difference

A              503477334    2018-10-12       2018-10-18       2
B              503477334    2018-10-07       2018-10-16       3
C              501022162    2018-10-12       2018-10-16       NaN
A              503487338    2018-10-13       2018-10-13       0
C              506012887    2018-10-13       2018-10-21       7 
E              503427339    2018-10-14       2018-10-17       NaN
smci
  • 32,567
  • 20
  • 113
  • 146
dpacman
  • 3,683
  • 2
  • 20
  • 35
  • Can you post the output of `csv1.to_dict()` and `csv2.to_dict()` so we can copy and paste them? – ignoring_gravity Dec 23 '19 at 14:13
  • 1
    pandas has `Series.between()` operator. See the 1229 existing questions [\[pandas\] between date](https://stackoverflow.com/search?q=%5Bpandas%5D+between+date+is%3Aq). Also, it generally helps to convert your datetime columns into datetimes when you read them in or right after, not much use having them as string. – smci Dec 24 '19 at 10:49
  • Sure, thanks @smci, jezrael has mentioned that in the answer below. But it considers only the mobile_number as the unique identifier. What if I want to filter it based on the mobile_number and a category. – dpacman Dec 24 '19 at 10:55
  • 1
    dPac it's hard to follow your question, it's sprinkled in multiple pieces between chunks of data, could you rewrite to state the question in the first paragraph? Presumably you first join `csv_1, _2` on `Mobile_Number`, then filter `.between('Book_date'...'App_date')`. But where in that sequence do you want to filter by Category? And it's confusing because you say "filter it based on a unique category", but your current output has multiple results for Category==A','C', for different `Book_Date,App_Date` values. Also, what is `Category`, where did it come from?... – smci Dec 24 '19 at 11:03
  • 1
    ...Did you just assign arbitrary `Category` values to your intermediate results (e.g. distinct combinations of `App_Date, Book_Date`??) or did it come from somewhere else? Anyway please edit your question to restate, it's unclear. Hence hard to find existing duplicate/related questions. – smci Dec 24 '19 at 11:05
  • @smci, Initially the join between `csv_1, _2` was made on the `Mobile_Number`, Now I want to join `csv_1, _2` on the `Mobile_Number` and `Category` and then filter `.between('Book_date'...'App_date')`. Sorry for the confusion caused. – dpacman Dec 24 '19 at 11:10
  • But what is `Category`, where did it come from? Not the csv files(?), so from where? – smci Dec 24 '19 at 11:27
  • 1
    I tried to edit it to state the question clearly, and at the top. a) Please try to be consistent whether you mean 'time range', 'date range', 'datetime range'. b) We still don't know where `Category` came from, did it come from another file, or was it just some default assigned to your temporary results? c) It's distracting when you keep referring to *"columns from csv_2" ... "create a new column in csv_1"*. Why don't you just join your data into one dataframe `df` at the start? (You can always write out separate sets of columns to separate CSV files with `to_csv(..., columns)`) – smci Dec 24 '19 at 11:50
  • @smci, thank you for helping. Jezrael added the answer I was looking for. Cheers!! – dpacman Dec 24 '19 at 12:03
  • 1
    ...but please tell us where `Category` comes from?! – smci Dec 24 '19 at 12:09
  • `Category` was present initially on both `csv_1` and `csv_2`, and I didn't want to use it when I first posted the question. But, the problem I was trying to solve needed the inclusion of `Category` for better accuracy, which is why I edited the question to include it. – dpacman Dec 24 '19 at 12:23
  • 1
    Related: [Select DataFrame rows between two dates](https://stackoverflow.com/questions/29370057/select-dataframe-rows-between-two-dates) – smci Dec 24 '19 at 13:02

1 Answers1

2

Use Series.map for new Series matched by Mobile_Number and for test values between columns use Series.between, then assign values by mask with numpy.where:

df1['Book_Date'] = pd.to_datetime(df1['Book_Date'])
df1['App_Date'] = pd.to_datetime(df1['App_Date'])
df2['Occur_Date'] = pd.to_datetime(df2['Occur_Date'])

s1 = df2.drop_duplicates('Mobile_Number').set_index('Mobile_Number')['Occur_Date']
s2 = df1['Mobile_Number'].map(s1)

m = s2.between(df1['Book_Date'], df1['App_Date'])

#solution with no mask
df1['Difference1'] = df1['App_Date'].sub(s2).dt.days
#solution with test between
df1['Difference2'] = np.where(m, df1['App_Date'].sub(s2).dt.days, np.nan)
print (df1)
   Mobile_Number  Book_Date   App_Date Difference  Difference1  Difference2
0      503477334 2018-10-12 2018-10-18 2018-10-16          2.0          2.0
1      506002884 2018-10-12 2018-10-19 2018-10-21         -2.0          NaN
2      501022162 2018-10-12 2018-10-16 2018-10-15          1.0          1.0
3      503487338 2018-10-13 2018-10-13 2018-10-13          0.0          0.0
4      506012887 2018-10-13 2018-10-21 2018-10-14          7.0          7.0
5      503427339 2018-10-14 2018-10-17        NaT          NaN          NaN

EDIT:

You can use merge instead map for join by 2 columns:

df1['Book_Date'] = pd.to_datetime(df1['Book_Date'])
df1['App_Date'] = pd.to_datetime(df1['App_Date'])
df2['Occur_Date'] = pd.to_datetime(df2['Occur_Date'])

df3 = df1.merge(df2, on=['Category','Mobile_Number'], how='left')
print (df3)
  Category  Mobile_Number  Book_Date   App_Date Occur_Date
0        A      503477334 2018-10-12 2018-10-18 2018-10-16
1        B      503477334 2018-10-07 2018-10-16 2018-10-13
2        C      501022162 2018-10-12 2018-10-16        NaT
3        A      503487338 2018-10-13 2018-10-13 2018-10-13
4        C      506012887 2018-10-13 2018-10-21 2018-10-14
5        E      503427339 2018-10-14 2018-10-17        NaT

m = df3['Occur_Date'].between(df3['Book_Date'], df3['App_Date'])
#print (m)

df3['Difference2'] = np.where(m, df3['App_Date'].sub(df3['Occur_Date']).dt.days, np.nan)
print (df3)
  Category  Mobile_Number  Book_Date   App_Date Occur_Date  Difference2
0        A      503477334 2018-10-12 2018-10-18 2018-10-16          2.0
1        B      503477334 2018-10-07 2018-10-16 2018-10-13          3.0
2        C      501022162 2018-10-12 2018-10-16        NaT          NaN
3        A      503487338 2018-10-13 2018-10-13 2018-10-13          0.0
4        C      506012887 2018-10-13 2018-10-21 2018-10-14          7.0
5        E      503427339 2018-10-14 2018-10-17        NaT          NaN
jezrael
  • 822,522
  • 95
  • 1,334
  • 1,252