python pandas find transacctions of the same amount

Question

I have a pandas dataframe as below.

d = {'emp': ['a', 'a', 'a', 'b', 'b', 'b'], 'vendor': ['x', 'x', 'y', 'z', 'z', 'z'], 'date': [1,1,2,3,3,3], 'amount': [4.9, 4.8, 1, 6, 5.6, 5.4]}
df = pd.DataFrame(data=d)

df["rounds"]=np.ceil(df['amount'])
df 

   amount  date emp vendor  rounds
0     4.9     1   a      x     5.0
1     4.8     1   a      x     5.0
2     1.0     2   a      y     1.0
3     6.0     3   b      z     6.0
4     5.6     3   b      z     6.0
5     5.4     3   b      z     6.0

I want to create the example column which would have a unique number if the same emp has spent the same amount (column rounds) at the same vendor on the same day.

an employee could have multiple transactions matching this criteria or they could have 0 transactions matching this criteria

how could i proceed?

when a number is same in the example column, it indicates that all transactions that fall in one group

update 1

another example

if my dataframe is like below

d = {'emp': ['a', 'a', 'a', 'a', 'b', 'b'], 'vendor': ['x', 'x', 'y', 'y', 'z', 'z'], 'date': [1,1,2,2,3,3], 'amount': [4.9, 4.8, 1, 1, 5.6, 5.4]}

then column example should have values '1,1,2,2,3,3'

Why 1 and 2 instead of 2 and 3 for 2 duplicate transactions and 3 duplicate transactions? — user3483203, Apr 17 '18 at 17:28
1 and 2 are numbers to indicate similar transactions...all the transaction where there is 1 are in 1 group etc... — Ni_Tempe, Apr 17 '18 at 17:29
So even though there are 2 identical transactions by employee a on day 1, you want the number to be 1? — user3483203, Apr 17 '18 at 17:31
@chrisz It can be any number. The main condition is that a number indicates all transactions that fall in one group — Ni_Tempe, Apr 17 '18 at 17:34
@RafaelC i felt that sql might be faster...but that is not a requirement — Ni_Tempe, Apr 17 '18 at 17:34

BENY · Accepted Answer · 2018-04-17T18:54:03.427

4

duplicated call within the transform, duplicated will find how many duplicate item for vendor you have for each group, then we just add them up, should achieve what you need , transform you can check the link

df.groupby(['date','emp','rounds']).vendor.transform(lambda x : x.duplicated().sum())
Out[806]: 
0    1
1    1
2    0
3    2
4    2
5    2
Name: vendor, dtype: int64

After assign it back

df['example']=df.groupby(['date','emp','rounds']).vendor.transform(lambda x : x.duplicated().sum())
#yield
df
Out[808]: 
   amount  date emp vendor  rounds  example
0     4.9     1   a      x     5.0        1
1     4.8     1   a      x     5.0        1
2     1.0     2   a      y     1.0        0
3     6.0     3   b      z     6.0        2
4     5.6     3   b      z     6.0        2
5     5.4     3   b      z     6.0        2

Update :

df['key']=tuple(zip(df.emp,df.vendor,df.date,df.rounds)) 
df[df.duplicated('key',keep=False)].groupby('key').ngroup()+1
Out[831]: 
0    1
1    1
2    2
3    2
4    3
5    3
dtype: int64

edited Apr 17 '18 at 18:54

answered Apr 17 '18 at 17:53

BENY

317,841
20
164
234

1

Only one idea, assign to column :) – jezrael Apr 17 '18 at 17:58
You also have to add `rounds` to your groupby. – floydn Apr 17 '18 at 17:58
1

Best solution by far – rafaelc Apr 17 '18 at 17:58
@wen could you explain your approach? and add `rounds` in your solution groupby list? – Ni_Tempe Apr 17 '18 at 18:02
would this require data to be sorted by columns `emp, vendor, date, amounts, rounds`? – Ni_Tempe Apr 17 '18 at 18:08
it is not working as intended...i have updated my question – Ni_Tempe Apr 17 '18 at 18:19
the first line should be `df['key']=tuple(zip(df.emp,df.vendor,df.date,df.rounds)) ` please update so that i can accept your answer – Ni_Tempe Apr 17 '18 at 18:51
@Ni_Tempe done , haha always forget your rounds column – BENY Apr 17 '18 at 18:54

Bubble Bubble Bubble Gut · Answer 2 · 2018-04-17T17:47:15.323

2

Not sure how to do it in SQL, but using Pandas:

# dp is a list of boolean showing whether the elements have duplicates
# setting keep=False returns all the duplicated elements as True
dp = df.duplicated(['date', 'emp', 'rounds'], keep=False)

# dp.dff().ne() means if the next element is no longer a duplicate
# then it's true, cumsum just increments the id
df['example'] = dp.diff().ne(0).cumsum()

df

   amount  date emp vendor  rounds  example
0     4.9     1   a      x     5.0        1
1     4.8     1   a      x     5.0        1
2     1.0     2   a      y     1.0        2
3     6.0     3   b      z     6.0        3
4     5.6     3   b      z     6.0        3
5     5.4     3   b      z     6.0        3

edited Apr 17 '18 at 17:47

answered Apr 17 '18 at 17:40

Bubble Bubble Bubble Gut

3,280
12
28

is there a way to remove 2 associated with the third row from the example column? There is only transaction where we have 2 and i don't want to flag such transactions – Ni_Tempe Apr 17 '18 at 17:42
and please briefly explain your logic for the entire solution so that i can learn from it..thanks – Ni_Tempe Apr 17 '18 at 17:43
@Ni_Tempe Do you still want to keep the row and set it as NaN? – Bubble Bubble Bubble Gut Apr 17 '18 at 17:43
@Ni_Tempe If you want the row set to NaN, I think RafaelC 's way using groupby is more intuitive. – Bubble Bubble Bubble Gut Apr 17 '18 at 17:49
it is not working as intended...i have updated my question – Ni_Tempe Apr 17 '18 at 18:19

rafaelc · Answer 3 · 2018-04-17T17:59:19.957

1

You could do

df["tmp"] = df.duplicated(subset = ["date", "amp", "rounds"])
df2 = df.set_index(["date", "emp"])
df2["example"]= df.groupby(["date", "emp"])["tmp"].sum()
df2 = df2.drop("tmp", 1).reset_index()

   date emp  amount vendor  rounds  example
0     1   a     4.9      x     5.0      1.0
1     1   a     4.8      x     5.0      1.0
2     2   a     1.0      y     1.0      0.0
3     3   b     6.0      z     6.0      2.0
4     3   b     5.6      z     6.0      2.0
5     3   b     5.4      z     6.0      2.0

We start with your provided df. The first line

df["tmp"] = df.duplicated(subset = ["date", "amp", "rounds"])

creates a temporary column that indicates whether a row is duplicated or not, i.e.

   amount  date emp vendor  rounds    tmp
0     4.9     1   a      x     5.0  False
1     4.8     1   a      x     5.0   True
2     1.0     2   a      y     1.0  False
3     6.0     3   b      z     6.0  False
4     5.6     3   b      z     6.0   True
5     5.4     3   b      z     6.0   True

The second and third lines

df2 = df.set_index(["date", "emp"])
df2["example"]= df.groupby(["date", "emp"])["tmp"].sum()

Just assign the sum of the groups of date and emp to column example. It does this by setting the indexes to be the same.

When two data frames have the same index, you can assign values of one to the other even if the number of rows is different: pandas will infer that values should duplicated as long as indexes are the same.

The last column just drops the tmp.

edited Apr 17 '18 at 17:59

answered Apr 17 '18 at 17:43

rafaelc

57,686
15
58
82

please explain your logic briefly – Ni_Tempe Apr 17 '18 at 17:45
1) in case of the line `df["tmp"] = df.duplicated(subset = ["date", "vendor", "rounds"])` why don't you have `emp` too? 2) would this solution require the data to be sorted by ascending order of columns `emp, vendor, date, amounts, rounds`? – Ni_Tempe Apr 17 '18 at 17:54
Actually, I mean to be `emp` instead of vendor :) I don't think it needs to be sorted. – rafaelc Apr 17 '18 at 17:58
i feel that all 4 columns should be in the first statement! what are your thoughts? – Ni_Tempe Apr 17 '18 at 18:00
With the data you provided, the impact is the same. Both yield the same result. So yeah you could keep both, but it depends on your real dataset – rafaelc Apr 17 '18 at 18:02
it is not working as intended...i have updated my question – Ni_Tempe Apr 17 '18 at 18:19

score 1 · Answer 4 · answered Apr 17 '18 at 18:02

If you want a unique number associated with each grouping that is generated based on the values within that grouping, you can use a hashing function, like md5:

from hashlib import md5

cols = ['emp','rounds','vendor','date']

def get_unique_id(x):
    if len(x) > 1:
        unique_string = ''.join(x.astype(str).values.flatten()).encode('utf8')
        unique_hash = str(int(md5(unique_string).hexdigest(), 16))
        return unique_hash

# use groupby to get unique groupings, then compute group hash id
grouped = df.groupby(cols).apply(get_unique_id).reset_index()

# merge with original df
df.merge(grouped, how="left", on=cols).rename(columns={0:'example'})

   amount  date emp vendor  rounds                                  example
0     4.9     1   a      x     5.0  204218359197769197011002751402112550616
1     4.8     1   a      x     5.0  204218359197769197011002751402112550616
2     1.0     2   a      y     1.0                                     None
3     6.0     3   b      z     6.0   93359096665893457037712279020314596965
4     5.6     3   b      z     6.0   93359096665893457037712279020314596965
5     5.4     3   b      z     6.0   93359096665893457037712279020314596965

python pandas find transacctions of the same amount

4 Answers4