2

I have a dataset with a lot of email and I want change this:

df = pd.DataFrame( [('aatest@gmail.com', 0, 3.0), ('aatest@gmail.com', 1, 2.0), 
                    ('aatest@gmail.com', 1 ,3.0), ('bbtest@gmail.com', 1, 1.0), 
                    ('cctest@gmail.com', 2, 5.0)]) 

df
0  aatest@gmail.com  0  3
1  aatest@gmail.com  1  2
2  aatest@gmail.com  1  3
3  bbtest@gmail.com  1  1
4  cctest@gmail.com  2  5

to this:

df2 = pd.DataFrame(
[(0, 0, 3.0), (0, 1, 2.0), (0,1 ,3.0), (1, 1, 1.0), (2, 2, 5.0)])

df2
   0  1  2
0  0  0  3
1  0  1  2
2  0  1  3
3  1  1  1
4  2  2  5

i.e, change the email to a number, but the same email stay with the same number

How can I do this?

jezrael
  • 822,522
  • 95
  • 1,334
  • 1,252
Kardu
  • 865
  • 3
  • 13
  • 24

1 Answers1

1

Use factorize:

df[0] = pd.factorize(df[0])[0]

print df

   0  1  2
0  0  0  3
1  0  1  2
2  0  1  3
3  1  1  1
4  2  2  5

Or rank:

df[0] = df[0].rank(method='dense') - 1
print df

   0  1  2
0  0  0  3
1  0  1  2
2  0  1  3
3  1  1  1
4  2  2  5
jezrael
  • 822,522
  • 95
  • 1,334
  • 1,252