1

I have a dataset where there are column with alphanumeric values. I am able to filter the unmatched records now I want to clean them, that means for example the unmatched record is 123*abc& then it should remove 123abc. I have done it but I don't think this is a correct way to do it and also the data is combined after the final results I can use for loop for getting them correctly but that will be a slow process. Hence looking for a easier way (column by column cleaning). Is it possible to do so?

data = ['abc123','abc*123&','Abc123','ABC@*&123',np.nan,'123*Abc']
df=pd.DataFrame(data, columns=['a'])
print(df)
       a
0     abc123
1   abc*123&
2     Abc123
3  ABC@*&123
4        NaN
5    123*Abc

Filtering unmatched records:

wrong=df[~df['a'].str.contains(numeric, na=True)]
print(wrong)
        a
1   abc*123&
3  ABC@*&123
5    123*Abc


wrong_index = wrong.index
result = ''.join(i for i in wrong['a'] if not i.isalpha())  
alphanumeric = [character for character in result if character.isalnum()]
alphanumeric = "".join(alphanumeric)
df['a'].loc[wrong_index]=alphanumeric
print(df)
     a
0   abc123
1   abc123ABC123123Abc
2   Abc123
3   abc123ABC123123Abc
4   NaN
5   abc123ABC123123Abc

I know why is this happening, can be resolved by using for or loop through each row but it consumes a lot of time. Is there any way where we can do column by column clean up?

Excepted output:

       a
0     abc123
1     abc123
2     Abc123
3     ABC123
4        NaN
5     123Abc
Pysdm
  • 35
  • 6

1 Answers1

0

Using plain vanilla Python using the built-in Regex module re will do. See this demo on IDEone: using regular-expression to replace list elements

import re

data = ['abc123','abc*123&','Abc123','ABC@*&123','123*Abc']
cleaned = [re.sub('\W', '', item) for item in data]
print(cleaned)

The script will output:

['abc123', 'abc123', 'Abc123', 'ABC123', '123Abc']

Explained

  1. The re.sub function substitutes a given string (here: item) like search & replace.
  2. The search is specified by regular-expression: \W all non-word characters (i.e. non numerical digits, non alpha letters).
  3. The replace is specified by an empty string to simply remove the found: ''.
  4. The for-loop is implemented as list-comprehension, a ideomatic or pythonic way for iterating through elements of a list.

Filter parts of above

If you like to filter to only parts, like only alphabetic or only numeric chars, then you need to combine metacharacters of PCRE like in this demo on IDEone:

import re

data = ['abc123','abc*123&','Abc123','ABC@*&123','123*Abc', '123_abc', '123 abc']

# replace non-alphas and non-digits; filter [A-Za-z0-9_]
alphanumeric_underscore = [re.sub('\W', '', item) for item in data]
print('alphanumeric_underscore', alphanumeric_underscore)

# replace also the underscore; filter [A-Za-z0-9]
alphanumeric = [re.sub('[\W_]', '', item) for item in data]
print('alphanumeric', alphanumeric)

# filter only digits
numeric = [re.search(r"\d+", item).group(0) for item in data]
print('numeric', numeric)

# filter only alphas
alpha = [re.search(r"[A-Za-z]+", item).group(0) for item in data]
print('alpha', alpha)

It will output:

alphanumeric_underscore ['abc123', 'abc123', 'Abc123', 'ABC123', '123Abc', '123_abc', '123abc'] alphanumeric ['abc123', 'abc123', 'Abc123', 'ABC123', '123Abc', '123abc', '123abc'] numeric ['123', '123', '123', '123', '123', '123', '123'] alpha ['abc', 'abc', 'Abc', 'ABC', 'Abc', 'abc', 'abc']

It uses the regex-search re.search with regex (prefixed as raw-string) r"\d+" to return all found occurrences .group(0), hence filter.

See also

hc_dev
  • 8,389
  • 1
  • 26
  • 38
  • Out of a doubt. In the same scenario, how do we remove integers – Pysdm Mar 24 '21 at 19:58
  • @Pysdm @Pysdm Whithout punctuation, I understand this as "rhetoric" question. Integers can be equally removed by _substituting_ the regex for _digits_ [`\d+`](https://regex101.com/r/ZbNITh/1) or the opposite: filtering using regex search for [any non-digits](https://en.m.wikibooks.org/wiki/Regular_Expressions/Perl-Compatible_Regular_Expressions), using the matching _PCRE metachar_ (it's analogue to `\w` matching _word-chars_ negated to `\W` .. so `\d` becomes ❔). – hc_dev Mar 24 '21 at 21:35
  • Sorry, that's a digit but when I am trying to remove both integers and special character it isn't working. When I gave just r'\d+' it is removing integers and for removing alphabets r'[a-zA-Z]+' but when I want to remove for int,spcl char it is not removing. I gave a pattern r'[$&+,:;=?@#|<>.^*()%!-][\d+]' it's removing – Pysdm Mar 24 '21 at 21:52
  • 1
    I got it. ([^\d]+?)' for removing integers and ([^a-zA-z]+?)' for alphabets – Pysdm Mar 24 '21 at 22:00
  • @Pysdm You got it ️ Even simpler, without parentheses/capture-group `()`, negation-operator `^` or greedy-[quantifier](https://www.rexegg.com/regex-quantifiers.html#cheat_sheet) `?`. For example to __keep only the special-chars__ can be turned into "remove [digits and alphas](https://regex101.com/r/ubX4jW/1)" by `re.sub('[A-Za-z\d]', '', s)`. – hc_dev Mar 25 '21 at 06:51