Finding unique combinations of columns from a dataframe

Question

In my below data set, I need to find unique sequences and assign them a serial no ..

DataSet :

user    age maritalstatus   product
A   Young   married 111
B   young   married 222
C   young   Single  111
D   old single  222
E   old married 111
F   teen    married 222
G   teen    married 555
H   adult   single  444
I   adult   single  333

Expected output:

young   married     0
young   single      1
old     single      2
old     married     3
teen    married     4
adult   single      5

After finding the unique values like shown above, if I pass a new user like below,

user age maritalstatus  
X     young  married

it should return me the products as a list .

X : [111, 222]

if there is no sequence, like below

user     age     maritalstatus  
    Y     adult  married

it should return me an empty list

Y : []

`Young` and `Single` is `young` and `single` ? Or need lowercase? — jezrael, Mar 05 '18 at 12:10
look also at wes mckinney's [blog entry](http://wesmckinney.com/blog/filtering-out-duplicate-dataframe-rows/) — manandearth, Mar 05 '18 at 12:10

jezrael · Accepted Answer · 2018-03-05T13:29:41.693

18

First select only columns for output and add drop_duplicates, last add new column by range:

df = df[['age','maritalstatus']].drop_duplicates()
df['no'] = range(len(df.index))
print (df)
     age maritalstatus  no
0  Young       married   0
1  young       married   1
2  young        Single   2
3    old        single   3
4    old       married   4
5   teen       married   5
7  adult        single   6

If want convert all values to lowercase first:

df = df[['age','maritalstatus']].apply(lambda x: x.str.lower()).drop_duplicates()
df['no'] = range(len(df.index))
print (df)
     age maritalstatus  no
0  young       married   0
2  young        single   1
3    old        single   2
4    old       married   3
5   teen       married   4
7  adult        single   5

EDIT:

First convert to lowercase:

df[['age','maritalstatus']] = df[['age','maritalstatus']].apply(lambda x: x.str.lower())
print (df)
  user    age maritalstatus  product
0    A  young       married      111
1    B  young       married      222
2    C  young        single      111
3    D    old        single      222
4    E    old       married      111
5    F   teen       married      222
6    G   teen       married      555
7    H  adult        single      444
8    I  adult        single      333

And then use merge for unique product converted to list:

df2 = pd.DataFrame([{'user':'X', 'age':'young', 'maritalstatus':'married'}])
print (df2)
     age maritalstatus user
0  young       married    X

a = pd.merge(df, df2, on=['age','maritalstatus'])['product'].unique().tolist()
print (a)
[111, 222]

df2 = pd.DataFrame([{'user':'X', 'age':'adult', 'maritalstatus':'married'}])
print (df2)
     age maritalstatus user
0  adult       married    X

a = pd.merge(df, df2, on=['age','maritalstatus'])['product'].unique().tolist()
print (a)
[]

But if need column use transform:

df['prod'] = df.groupby(['age', 'maritalstatus'])['product'].transform('unique')
print (df)
  user    age maritalstatus  product        prod
0    A  young       married      111  [111, 222]
1    B  young       married      222  [111, 222]
2    C  young        single      111       [111]
3    D    old        single      222       [222]
4    E    old       married      111       [111]
5    F   teen       married      222  [222, 555]
6    G   teen       married      555  [222, 555]
7    H  adult        single      444  [444, 333]
8    I  adult        single      333  [444, 333]

EDIT1:

a = (pd.merge(df, df2, on=['age','maritalstatus'])
       .groupby('user_y')['product']
       .apply(lambda x: x.unique().tolist())
       .to_dict())
print (a)
{'X': [111, 222]}

Detail:

print (pd.merge(df, df2, on=['age','maritalstatus']))
  user_x    age maritalstatus  product user_y
0      A  young       married      111      X
1      B  young       married      222      X

edited Mar 05 '18 at 13:29

answered Mar 05 '18 at 12:09

jezrael

822,522
95
1,334
1,252

whats `user_y` ?? – pylearner Mar 05 '18 at 13:25
For me `pd.merge(df, df2, on=['age','maritalstatus'])` return `user_x` and `user_y`, so was used `user_y`. – jezrael Mar 05 '18 at 13:26
@pylearner - I edit answer , there are in putput 2 columns for users `user_x` and `user_y` what is used for build dictionary. – jezrael Mar 05 '18 at 13:30
for my dataframe, in `user_y`, I have multiple users matching ... its throwing me an error – pylearner Mar 05 '18 at 13:32
I have idea, give me sec – jezrael Mar 05 '18 at 13:33
1

@pylearner - Can you check `a = pd.merge(df[['age','maritalstatus', 'product']], df2, on=['age','maritalstatus']).groupby('user')['product'].apply(lambda x: x.unique().tolist()).to_dict()` ? – jezrael Mar 05 '18 at 13:34
can you help on this ....https://stackoverflow.com/questions/49124436/selecting-only-required-keys-from-a-dictionary-using-a-dataframe – pylearner Mar 06 '18 at 06:12
Jez, can we modify the above code ? I have millions of customers and its throwing a memory error – pylearner Mar 06 '18 at 07:59
@pylearner - Unfortunately in pandas it is not easy :( I think problem is your data are really large and small RAM of your PC. Solutions are 3 - add RAM(practically very often problematic), use `dask`, `blaze` or some another technology, or processes data by chunks (e.g. by 500MB), drop duplicates for each chunk and last concat output. – jezrael Mar 06 '18 at 08:07
@pylearner - Aslo you can check [this](https://stackoverflow.com/questions/14262433/large-data-work-flows-using-pandas) – jezrael Mar 06 '18 at 08:23
hey if I pass the same user A in my DF2 in the above, it should return me [222] as he has already purchased [111] .. can we achieve this ? – pylearner Mar 06 '18 at 10:09
@pylearner - Not sure if understand, can you create new question? – jezrael Mar 06 '18 at 10:14
https://stackoverflow.com/questions/49128817/grouping-data-using-unique-combinations – pylearner Mar 06 '18 at 10:40
it is almost the same as yesterday – pylearner Mar 06 '18 at 10:40
are you looking into it ? – pylearner Mar 06 '18 at 11:03
@pylearner - I check it, really complicated. I need more time. – jezrael Mar 06 '18 at 11:29
@pylearner - I start working with it. But dont understand `A: [222] - as user A has already purchased 111` - Why? Because in second DataFrame is no information about `product`. Can you explain more? – jezrael Mar 06 '18 at 11:46
If I pass the information of user A again, generally it will output `A: [111,222]`, as per the code you provided above. But what I want is, as A already has product 111 in the dataframe, the age and marital status is matching with `user B`.. so it should return only `[222]` – pylearner Mar 06 '18 at 11:57
New one, if this can be done... we can ignore the earlier one. https://stackoverflow.com/questions/49130520/subtracting-values-of-a-list-in-a-dictionary-using-a-dataframe – pylearner Mar 06 '18 at 12:11
Jez, i had a memory error and was not able to merge... please look into this question. Help me out :( ... https://stackoverflow.com/questions/49137714/assign-the-values-which-are-not-present-already-in-the-column – pylearner Mar 06 '18 at 18:33
I guess you are sleeping ? – pylearner Mar 06 '18 at 19:01
@pylearner - now I check my phone, but I am offline with PC. Allen solution failed? Is neccesary something else? Tomorrow I try help, but this is not easy, unfortu.ately :( – jezrael Mar 06 '18 at 20:24
yes this code `a = pd.merge(df[['age','maritalstatus', 'product']], df2, on=['age','maritalstatus']).groupby('user')['product'].apply(lambda x: x.unique().tolist()).to_dict()` gives me memory error. Please try and help me out. Thanks – pylearner Mar 07 '18 at 02:54

jpp · Answer 2 · 2018-03-05T12:36:08.023

One way is pd.factorize. Note I convert columns to lower case first for results to make sense.

for col in ['user', 'age', 'maritalstatus']:
    df[col] = df[col].str.lower()

df['category'] = list(zip(df.age, df.maritalstatus))
df['category'] = pd.factorize(df['category'])[0]

#   user    age maritalstatus  product  category
# 0    a  young       married      111         0
# 1    b  young       married      222         0
# 2    c  young        single      111         1
# 3    d    old        single      222         2
# 4    e    old       married      111         3
# 5    f   teen       married      222         4
# 6    g   teen       married      555         4
# 7    h  adult        single      444         5
# 8    i  adult        single      333         5

Finally, drop duplicates:

df_cats = df[['age', 'maritalstatus', 'category']].drop_duplicates()

#      age maritalstatus  category
# 0  young       married         0
# 2  young        single         1
# 3    old        single         2
# 4    old       married         3
# 5   teen       married         4
# 7  adult        single         5

To map a list of products, try this:

s = df.groupby(['age', 'maritalstatus'])['product'].apply(list)
df['prod_catwise'] = list(map(s.get, zip(df.age, df.maritalstatus)))

Another option is to use categorical data, which I highly recommend for workflows. You can easily extract codes from a categorical series via pd.Series.cat.codes.

Finding unique combinations of columns from a dataframe

2 Answers2

Linked