2

I have a pandas dataframe as below.

d = {'emp': ['a', 'a', 'a', 'b', 'b', 'b'], 'vendor': ['x', 'x', 'y', 'z', 'z', 'z'], 'date': [1,1,2,3,3,3], 'amount': [4.9, 4.8, 1, 6, 5.6, 5.4]}
df = pd.DataFrame(data=d)

df["rounds"]=np.ceil(df['amount'])
df 

   amount  date emp vendor  rounds
0     4.9     1   a      x     5.0
1     4.8     1   a      x     5.0
2     1.0     2   a      y     1.0
3     6.0     3   b      z     6.0
4     5.6     3   b      z     6.0
5     5.4     3   b      z     6.0

I want to create the example column which would have a unique number if the same emp has spent the same amount (column rounds) at the same vendor on the same day.

an employee could have multiple transactions matching this criteria or they could have 0 transactions matching this criteria

how could i proceed?

 example
    1
    1
    
    2
    2
    2

when a number is same in the example column, it indicates that all transactions that fall in one group

update 1

another example

if my dataframe is like below

d = {'emp': ['a', 'a', 'a', 'a', 'b', 'b'], 'vendor': ['x', 'x', 'y', 'y', 'z', 'z'], 'date': [1,1,2,2,3,3], 'amount': [4.9, 4.8, 1, 1, 5.6, 5.4]}

then column example should have values '1,1,2,2,3,3'

Community
  • 1
  • 1
Ni_Tempe
  • 307
  • 1
  • 6
  • 20

4 Answers4

4

duplicated call within the transform, duplicated will find how many duplicate item for vendor you have for each group, then we just add them up, should achieve what you need , transform you can check the link

df.groupby(['date','emp','rounds']).vendor.transform(lambda x : x.duplicated().sum())
Out[806]: 
0    1
1    1
2    0
3    2
4    2
5    2
Name: vendor, dtype: int64

After assign it back

df['example']=df.groupby(['date','emp','rounds']).vendor.transform(lambda x : x.duplicated().sum())
#yield
df
Out[808]: 
   amount  date emp vendor  rounds  example
0     4.9     1   a      x     5.0        1
1     4.8     1   a      x     5.0        1
2     1.0     2   a      y     1.0        0
3     6.0     3   b      z     6.0        2
4     5.6     3   b      z     6.0        2
5     5.4     3   b      z     6.0        2

Update :

df['key']=tuple(zip(df.emp,df.vendor,df.date,df.rounds)) 
df[df.duplicated('key',keep=False)].groupby('key').ngroup()+1
Out[831]: 
0    1
1    1
2    2
3    2
4    3
5    3
dtype: int64
BENY
  • 317,841
  • 20
  • 164
  • 234
2

Not sure how to do it in SQL, but using Pandas:

# dp is a list of boolean showing whether the elements have duplicates
# setting keep=False returns all the duplicated elements as True
dp = df.duplicated(['date', 'emp', 'rounds'], keep=False)

# dp.dff().ne() means if the next element is no longer a duplicate
# then it's true, cumsum just increments the id
df['example'] = dp.diff().ne(0).cumsum()

df

   amount  date emp vendor  rounds  example
0     4.9     1   a      x     5.0        1
1     4.8     1   a      x     5.0        1
2     1.0     2   a      y     1.0        2
3     6.0     3   b      z     6.0        3
4     5.6     3   b      z     6.0        3
5     5.4     3   b      z     6.0        3
1

You could do

df["tmp"] = df.duplicated(subset = ["date", "amp", "rounds"])
df2 = df.set_index(["date", "emp"])
df2["example"]= df.groupby(["date", "emp"])["tmp"].sum()
df2 = df2.drop("tmp", 1).reset_index()

   date emp  amount vendor  rounds  example
0     1   a     4.9      x     5.0      1.0
1     1   a     4.8      x     5.0      1.0
2     2   a     1.0      y     1.0      0.0
3     3   b     6.0      z     6.0      2.0
4     3   b     5.6      z     6.0      2.0
5     3   b     5.4      z     6.0      2.0

We start with your provided df. The first line

df["tmp"] = df.duplicated(subset = ["date", "amp", "rounds"])

creates a temporary column that indicates whether a row is duplicated or not, i.e.

   amount  date emp vendor  rounds    tmp
0     4.9     1   a      x     5.0  False
1     4.8     1   a      x     5.0   True
2     1.0     2   a      y     1.0  False
3     6.0     3   b      z     6.0  False
4     5.6     3   b      z     6.0   True
5     5.4     3   b      z     6.0   True

The second and third lines

df2 = df.set_index(["date", "emp"])
df2["example"]= df.groupby(["date", "emp"])["tmp"].sum()

Just assign the sum of the groups of date and emp to column example. It does this by setting the indexes to be the same.

When two data frames have the same index, you can assign values of one to the other even if the number of rows is different: pandas will infer that values should duplicated as long as indexes are the same.

The last column just drops the tmp.

rafaelc
  • 57,686
  • 15
  • 58
  • 82
  • please explain your logic briefly – Ni_Tempe Apr 17 '18 at 17:45
  • 1) in case of the line `df["tmp"] = df.duplicated(subset = ["date", "vendor", "rounds"])` why don't you have `emp` too? 2) would this solution require the data to be sorted by ascending order of columns `emp, vendor, date, amounts, rounds`? – Ni_Tempe Apr 17 '18 at 17:54
  • Actually, I mean to be `emp` instead of vendor :) I don't think it needs to be sorted. – rafaelc Apr 17 '18 at 17:58
  • i feel that all 4 columns should be in the first statement! what are your thoughts? – Ni_Tempe Apr 17 '18 at 18:00
  • With the data you provided, the impact is the same. Both yield the same result. So yeah you could keep both, but it depends on your real dataset – rafaelc Apr 17 '18 at 18:02
  • it is not working as intended...i have updated my question – Ni_Tempe Apr 17 '18 at 18:19
1

If you want a unique number associated with each grouping that is generated based on the values within that grouping, you can use a hashing function, like md5:

from hashlib import md5

cols = ['emp','rounds','vendor','date']

def get_unique_id(x):
    if len(x) > 1:
        unique_string = ''.join(x.astype(str).values.flatten()).encode('utf8')
        unique_hash = str(int(md5(unique_string).hexdigest(), 16))
        return unique_hash

# use groupby to get unique groupings, then compute group hash id
grouped = df.groupby(cols).apply(get_unique_id).reset_index()

# merge with original df
df.merge(grouped, how="left", on=cols).rename(columns={0:'example'})

   amount  date emp vendor  rounds                                  example
0     4.9     1   a      x     5.0  204218359197769197011002751402112550616
1     4.8     1   a      x     5.0  204218359197769197011002751402112550616
2     1.0     2   a      y     1.0                                     None
3     6.0     3   b      z     6.0   93359096665893457037712279020314596965
4     5.6     3   b      z     6.0   93359096665893457037712279020314596965
5     5.4     3   b      z     6.0   93359096665893457037712279020314596965
andrew_reece
  • 20,390
  • 3
  • 33
  • 58