I have a dataset where there are column with alphanumeric values. I am able to filter the unmatched records now I want to clean them, that means for example the unmatched record is 123*abc&
then it should remove 123abc
. I have done it but I don't think this is a correct way to do it and also the data is combined after the final results I can use for loop for getting them correctly but that will be a slow process. Hence looking for a easier way (column by column cleaning). Is it possible to do so?
data = ['abc123','abc*123&','Abc123','ABC@*&123',np.nan,'123*Abc']
df=pd.DataFrame(data, columns=['a'])
print(df)
a
0 abc123
1 abc*123&
2 Abc123
3 ABC@*&123
4 NaN
5 123*Abc
Filtering unmatched records:
wrong=df[~df['a'].str.contains(numeric, na=True)]
print(wrong)
a
1 abc*123&
3 ABC@*&123
5 123*Abc
wrong_index = wrong.index
result = ''.join(i for i in wrong['a'] if not i.isalpha())
alphanumeric = [character for character in result if character.isalnum()]
alphanumeric = "".join(alphanumeric)
df['a'].loc[wrong_index]=alphanumeric
print(df)
a
0 abc123
1 abc123ABC123123Abc
2 Abc123
3 abc123ABC123123Abc
4 NaN
5 abc123ABC123123Abc
I know why is this happening, can be resolved by using for or loop through each row but it consumes a lot of time. Is there any way where we can do column by column clean up?
Excepted output:
a
0 abc123
1 abc123
2 Abc123
3 ABC123
4 NaN
5 123Abc