4

I'd like to filter for customer_id's that were not present in the previous data so all new_customer_ids that were new on the 2020-01-10 and not present on the 2020-01-01

Main_df

date          customer_id   amount_spent 
2020-01-01    24            123
2020-01-10    24            145
2020-01-01    58             89
2020-01-10    58             67
2020-01-01    98             34
2020-01-10    99             86
2020-01-10    67            140
2020-01-10    32            321
2020-01-10    75             76

Output_df

new_customer_id  amount_spent 
32           321
75            76
67           140

I have tried to use the shift function in Pandas but this did not work for me

EDIT

df = pd.DataFrame([["2020-01-01",24,123],
["2020-01-10",24,145],
["2020-01-01",58,89],
["2020-01-10",58,67],
["2020-01-01",98,34],
["2020-01-10",98,86],
["2020-01-10",67,140],
["2020-01-10",32,321],
["2020-01-10",75,76]],columns = ["date","customer_id","amount_spent" ])
Let's try
  • 1,044
  • 9
  • 20
user12625679
  • 676
  • 8
  • 23
  • Does this answer your question? [Filtering Pandas DataFrames on dates](https://stackoverflow.com/questions/22898824/filtering-pandas-dataframes-on-dates) – Carlo Zanocco Aug 25 '20 at 07:48
  • No as I only need to filter who the rows that were not present in the previous date – user12625679 Aug 25 '20 at 07:53

5 Answers5

1

IIUC you can get the customer_id present in 2020-01-01, then filter them out:

s = df.loc[df["date"]=="2020-01-01", "customer_id"]

print (df[~df["customer_id"].isin(s)])

         date  customer_id  amount_spent
5  2020-01-10           99            86
6  2020-01-10           67           140
7  2020-01-10           32           321
8  2020-01-10           75            76
Henry Yik
  • 22,275
  • 4
  • 18
  • 40
0

Here is another solution,

import numpy as np

mask = df.groupby('customer_id').transform(np.size).eq(1)

    date  amount_spent
0  False         False
1  False         False
2  False         False
3  False         False
4   True          True
5   True          True
6   True          True
7   True          True
8   True          True

df[mask['date'] & df.date.eq('2020-01-10')]

         date  customer_id  amount_spent
5  2020-01-10           99            86
6  2020-01-10           67           140
7  2020-01-10           32           321
8  2020-01-10           75            76
sushanth
  • 8,275
  • 3
  • 17
  • 28
0

Assuming you have a typo in your example (99 is 98). You can do the following:

df = pd.DataFrame([["2020-01-01",24,123],
["2020-01-10",24,145],
["2020-01-01",58,89],
["2020-01-10",58,67],
["2020-01-01",98,34],
["2020-01-10",98,86],
["2020-01-10",67,140],
["2020-01-10",32,321],
["2020-01-10",75,76]],columns = ["date","customer_id","amount_spent" ])

df["order"] = df.groupby("customer_id").cumcount()

df[(df["date"] == "2020-01-10") & (df["order_x"]==0)]

Output:

    date        customer_id amount_spent    order_x order_y
6   2020-01-10  67          140             0       0
7   2020-01-10  32          321             0       0
8   2020-01-10  75          76              0       0

This will need to be edited depending on the complexity of your df

Let's try
  • 1,044
  • 9
  • 20
0

This does what you noted. Not sure your sample data and output is as you thought. I changed customer 99 to 98 for 2020-01-10

  1. create a mask which are rows before / after your required date
  2. select rows after and including cutover date, minus customers that are present before cutover date isin()
import datetime as dt
df = pd.read_csv(io.StringIO("""date          customer_id   amount_spent 
2020-01-01    24            123
2020-01-10    24            145
2020-01-01    58             89
2020-01-10    58             67
2020-01-01    98             34
2020-01-10    98             86
2020-01-10    67            140
2020-01-10    32            321
2020-01-10    75             76"""), sep="\s+")

df["date"] = pd.to_datetime(df["date"])
mask = df["date"] < dt.datetime(2020,1,10)
dfnew = df[~mask & ~df["customer_id"].isin(df.loc[mask,"customer_id"])].groupby("customer_id").sum()

print(dfnew.to_string())

output

             amount_spent
customer_id              
32                    321
67                    140
75                     76
Rob Raymond
  • 29,118
  • 3
  • 14
  • 30
0

If you are looking for universal method this could be a solution:

df = pd.DataFrame({
    'date':['2020-01-01','2020-01-10','2020-01-01','2020-01-10','2020-01-01','2020-01-10','2020-01-10','2020-01-10','2020-01-10'],
    'customer_id':[24,24,58,58,98,99,67,32,75],
    'amount_spent':[123,145,89,67,34,86,140,321,76]
})
print(df)
         date  customer_id  amount_spent
0  2020-01-01           24           123
1  2020-01-10           24           145
2  2020-01-01           58            89
3  2020-01-10           58            67
4  2020-01-01           98            34
5  2020-01-10           99            86
6  2020-01-10           67           140
7  2020-01-10           32           321
8  2020-01-10           75            76

You are looking for two last dates, because your dataset could looks different, and you dont know what date are you looking for. So now you should find two last dates.

df=df.sort_values(by='date')
take_last_dates = df.drop_duplicates(subset='date').sort_values(by='date')
take_last_dates = take_last_dates.date.tolist()
print(take_last_dates)
['2020-01-01', '2020-01-10']

Now you need to create two DF's , for those two dates, to see the differences in customers:

df_prev = df[
    df.date==take_last_dates[0]
]
print(df_prev)
         date  customer_id  amount_spent
0  2020-01-01           24           123
2  2020-01-01           58            89
4  2020-01-01           98            34
df_current = df[
    df.date==take_last_dates[1]
]

print(df_current)
         date  customer_id  amount_spent
1  2020-01-10           24           145
3  2020-01-10           58            67
5  2020-01-10           99            86
6  2020-01-10           67           140
7  2020-01-10           32           321
8  2020-01-10           75            76

So at the end you can get your result by using those two df's :

new_customers = df_current[
    ~df_current.customer_id.isin(df_prev.customer_id.tolist())
]

print(new_customers)
         date  customer_id  amount_spent
5  2020-01-10           99            86
6  2020-01-10           67           140
7  2020-01-10           32           321
8  2020-01-10           75            76
sygneto
  • 1,761
  • 1
  • 13
  • 26