Remove all the characters and numbers except comma

Question

I am trying to remove all the characters from string in the DataFrame column but keep the comma but it still removes everything including the comma.

I know the question has been asked before but I tried many answers and all remove the comma as well.

df[new_text_field_name] = df[new_text_field_name].apply(lambda elem: re.sub(r"(@[A-Za-z0-9]+)|([^0-9A-Za-z \t])|(\w+:\/\/\S+)|^rt|http.+?", "", str(elem)))

sample text:

'100 % polyester, Paperboard (min. 30% recycled), 100% polypropylene',

the required output:

' polyester, Paperboard , polypropylene',

Can you point to where you found the question already asked before? — mkrieger1, Mar 27 '22 at 17:31
If you want to remove all characters except commas, can you not just count the commas and replace the entire string with that number of commas? — mkrieger1, Mar 27 '22 at 17:32
That would not be a good idea. I suggest so simply check if it is a comma or an alphabet, else replace it. — The Myth, Mar 27 '22 at 17:38
@mkrieger1 https://stackoverflow.com/questions/39672094/how-to-remove-all-special-character-in-a-string-except-dot-and-comma https://stackoverflow.com/questions/16326695/how-to-match-all-special-characters-except-a-comma — rooya sh, Mar 27 '22 at 17:53

gremur · Accepted Answer · 2022-04-02T09:40:50.010

2

Possible solution is the following:

# pip install pandas

import pandas as pd
pd.set_option('display.max_colwidth', 200)

# set test data and create dataframe
data = {"text": ['100 % polyester, Paperboard (min. 30% recycled), 100% polypropylene','Polypropylene plastic', '100 % polyester, Paperboard (min. 30% recycled), 100% polypropylene', 'Bamboo, Clear nitrocellulose lacquer', 'Willow, Stain, Solid wood, Polypropylene plastic, Stainless steel, Steel, Galvanized, Steel, 100% polypropylene', 'Banana fibres, Clear lacquer', 'Polypropylene plastic (min. 20% recycled)']}
df = pd.DataFrame(data)

def cleanup(txt):
    re_pattern = re.compile(r"[^a-z, ()]", re.I)
    return re.sub(re_pattern, "", txt).replace("  ", " ").strip()

df['text_cleaned'] = df['text'].apply(cleanup)
df

Returns

edited Apr 02 '22 at 09:40

answered Mar 27 '22 at 17:52

gremur

1,645
2
7
20

Is there any way to not remove this " (min recycled)"? Also few more sentences; 'Polypropylene plastic', '100 % polyester, Paperboard (min. 30% recycled), 100% polypropylene', 'Bamboo, Clear nitrocellulose lacquer', 'Willow, Stain, Solid wood, Polypropylene plastic, Stainless steel, Steel, Galvanized, Steel, 100% polypropylene', 'Banana fibres, Clear lacquer', 'Polypropylene plastic (min. 20% recycled)', – rooya sh Mar 27 '22 at 18:14
1

@rooyash, please take a look at the updated code. That is a little bit diff solution to remove everything except required data – gremur Mar 27 '22 at 19:08

score -1 · Answer 2 · answered Mar 27 '22 at 18:10

-1

Character.isDigit() and Character.isLetter() functions can be used to identify whether it is number or character.

answered Mar 27 '22 at 18:10

lata singh

1

This does not resolve the issue. The poster is looking to remove the characters, not simply identify them. – D-S Mar 28 '22 at 03:05

Remove all the characters and numbers except comma

2 Answers2