0

I want to delete a substring between a '+' and a '@' symbol together with the '+, if the '+' exists.

d = {'1' : 'dsjlskdgj+fdfsd@test.com', '2' : 'qwioept@test.com', '3' : 'dccnvmxcv+fas@test.com', '4':'dqlt@test.com'}

test_frame = pd.Series(d)

test_frame
Out[6]: 
1    dsjlskdgj+fdfsd@test.com
2            qwioept@test.com
3      dccnvmxcv+fas@test.com
4               dqlt@test.com
dtype: object

So, the result should be:

s = {'1' : 'dsjlskdgj@test.com', '2' : 'qwioept@test.com', '3' : 'dccnvmxcv@test.com', '4':'dqlt@test.com'}

test_frame_result = pd.Series(s)

test_frame_result
Out[10]: 
1    dsjlskdgj@test.com
2      qwioept@test.com
3    dccnvmxcv@test.com
4         dqlt@test.com
dtype: object

I tried it with split, but due to the fact that only some lines contain a +, it fails.

Is there an elegant solution without looping through all the lines (in the original dataset there are quite many).

Thanks!

maxtenzin
  • 129
  • 4
  • If you don't "loop through all the lines" how can you process all of them? – user202729 Feb 06 '18 at 15:24
  • Does [this](https://stackoverflow.com/questions/4444477/how-to-tell-if-a-string-contains-a-certain-character-in-javascript) solve your problem "only some lines contain a +"? – user202729 Feb 06 '18 at 15:24
  • Have to execute this in Pandas. – maxtenzin Feb 06 '18 at 15:26
  • Sorry, wrong language. – user202729 Feb 06 '18 at 15:27
  • Ad first comment: if I only wanted the first 5 letters I could do that without looping through: test_frame_result.str[:5] – maxtenzin Feb 06 '18 at 15:27
  • What about [this](https://stackoverflow.com/questions/26577516/pandas-test-if-string-contains-one-of-the-substrings-in-a-list)? Also implicitly the slice operator is (most likely) implemented using loops. Just that a loop in C is (often) faster than a loop in a higher level language. – user202729 Feb 06 '18 at 15:28

2 Answers2

1

Is this sufficient?

import pandas as pd
d = {'1' : 'dsjlskdgj+fdfsd@test.com', 
         '2' : 'qwioept@test.com', 
         '3' : 'dccnvmxcv+fas@test.com', 
         '4':'dqlt@test.com'}

test_frame = pd.Series(d)
test_frame
print test_frame

found = test_frame[test_frame.str.contains(r'\+')]
test_frame[found.index] = found.str.replace(r'\+[^@]*', "")
print test_frame

Output:

(Before)

1    dsjlskdgj+fdfsd@test.com
2            qwioept@test.com
3      dccnvmxcv+fas@test.com
4               dqlt@test.com
dtype: object

(After)

1    dsjlskdgj@test.com
2      qwioept@test.com
3    dccnvmxcv@test.com
4         dqlt@test.com
dtype: object
0

Found a solution - probably not the most elegant though:

import pandas as pd

test_frame = pd.DataFrame({'email':['dsjlskdgj+fdfsd@test.com','qwioept@test.com','dccnvmxcv+fas@test.com','dqlt@test.com']})

test_frame
Out[22]: 
                      email
0  dsjlskdgj+fdfsd@test.com
1          qwioept@test.com
2    dccnvmxcv+fas@test.com
3             dqlt@test.com

test_frame.loc[test_frame.email.str.contains('\+'),'email'] = test_frame[test_frame.email.str.contains('\+')].email.str.partition('+')[0] + '@' + test_frame[test_frame.email.str.contains('\+')].email.str.partition('+')[2].str.partition('@')[2]

test_frame
Out[24]: 
                email
0  dsjlskdgj@test.com
1    qwioept@test.com
2  dccnvmxcv@test.com
3       dqlt@test.com
maxtenzin
  • 129
  • 4