1

I have a dataframe with client_id column that I want to anonymize without any possibility to roll back.

I want to delete client_id but create the same value for each raw linked to the client (new column)

import pandas as pd

df = pd.DataFrame({
    'client_id':[111, 222, 111, 222, 333, 222, 111, 333], 
    'date':['2018-08-20', '2018-08-22', '2018-08-21', '2018-08-21', '2018-08-18', '2018-08-20', '2018-08-18', '2018-08-19'], 
    'action':['test1', 'test2', 'test3', 'test4', 'test5', 'test6', 'test7', 'test8']
    })

My dataframe:

client_id |    date   |  action |
    -----------------------------
    111 | '2018-08-20'| test1   | 
    222 | '2018-08-22'| test2   | 
    111 | '2018-08-21'| test3   |
    222 | '2018-08-21'| test4   |
    333 | '2018-08-18'| test5   |
    222 | '2018-08-20'| test6   |
    111 | '2018-08-18'| test7   | 
    333 | '2018-08-19'| test8   | 

The result expected:

 id |    date   |  action |
-----------------------------
1   | '2018-08-20'| test1   | 
2   | '2018-08-22'| test2   | 
1   | '2018-08-21'| test3   |
2   | '2018-08-21'| test4   |
3   | '2018-08-18'| test5   |
2   | '2018-08-20'| test6   |
1   | '2018-08-18'| test7   | 
3   | '2018-08-19'| test8   | 

I tried to use pandas.core.groupby.DataFrameGroupBy.rank but it did show the expected result

 df['id']= df.groupby("client_id")["date"].rank(ascending=True)
Cœur
  • 37,241
  • 25
  • 195
  • 267
Omar14
  • 2,007
  • 4
  • 21
  • 34
  • 1
    `df['id'] = pd.factorize(df['client_id'])[0] + 1` ? – jezrael Aug 22 '18 at 13:43
  • @jezrael ranking implies an ordering. I think we can find another dup target. – piRSquared Aug 22 '18 at 13:48
  • @piRSquared - Check my solution in dupe :) – jezrael Aug 22 '18 at 13:48
  • Actually, your solution is incorrect. `pd.factorize` does not sort and will potentially produce results assigning `0` for a value of `30`, `1` for a value of `50` and `2` for a value `20` if the order in the original array made it so. You'd have to change your answer to include `sort=True` or use `np.unique`. In either case, the question is different. The dup assumes you want to rank an ordered array. These `client_id`s don't have to be ordered. They just want a one-to-one mapping. I know we've answered that before as well (-: – piRSquared Aug 22 '18 at 13:53
  • @piRSquared - OK, changed dupe. – jezrael Aug 22 '18 at 13:55

1 Answers1

4

pandas.factorize

df.assign(client_id=df.client_id.factorize()[0] + 1)

  action  client_id        date
0  test1          1  2018-08-20
1  test2          2  2018-08-22
2  test3          1  2018-08-21
3  test4          2  2018-08-21
4  test5          3  2018-08-18
5  test6          2  2018-08-20
6  test7          1  2018-08-18
7  test8          3  2018-08-19

numpy.unique

df.assign(client_id=np.unique(df.client_id, return_inverse=True)[1] + 1)

  action  client_id        date
0  test1          1  2018-08-20
1  test2          2  2018-08-22
2  test3          1  2018-08-21
3  test4          2  2018-08-21
4  test5          3  2018-08-18
5  test6          2  2018-08-20
6  test7          1  2018-08-18
7  test8          3  2018-08-19
piRSquared
  • 285,575
  • 57
  • 475
  • 624