How to drop and keep only certain non alphanumeric characters?

Question

I Have df that looks like this:

email                                    id
{'email': ['test@test.com']}           {'id': ['123abc_d456_789_fgh']}

when I drop non alphanumeric characters like so:

df.email = df.email.str.replace('[^a-zA-Z]', '')
df.email = df.email.str.replace('email', '')


df.id = df.id.str.replace('[^a-zA-Z]', '')
df.id = df.id.str.replace('id', '')

The columns look like this:

email                    id
testtestcom              123abcd456789fgh

How do I tell the code to not drop anything in the square brackets but drop all non alpha numeric characters outside the brackets?

New df should like this:

email                        id
test@test.com                123abc_d456_789_fgh

score 2 · Accepted Answer · answered Jan 28 '19 at 17:16

2

This is hardcoded, but works:

df.email = df.email.str.replace(".+\['|'].+", '')
df.id = df.id.str.replace(".+\['|'].+", '')

>>> 'test@test.com'
>>> '123abc_d456_789_fgh'

answered Jan 28 '19 at 17:16

Gianmar

505
3
13

The fourth bird · Answer 2 · 2019-01-28T17:16:13.470

1

According to the comments, what you might do is capture what is in between the square brackets in a capturing group.

In the replacement use the first capturing group.

\{'[^']+':\s*\['([^][]+)'\]}

That will match

\{ Match {
'[^']+' Match ', then not ' 1+ times
: Match literally
\s*\[' Match 0+ times a whitespace character and then [
([^][]+) Capture group, match not [ or ]
'\] Match ]
} Match literally

Regex demo | Python demo

edited Jan 28 '19 at 17:16

answered Jan 28 '19 at 16:27

The fourth bird

154,723
16
55
70

`df.email = df.email.str.replace('(\[[^][]+\])|[^a-zA-Z]','')` drops everything but the word `email` – RustyShackleford Jan 28 '19 at 16:58
Your question is `How do I tell the code to not drop anything in the square brackets but drop all non alpha numeric characters outside the brackets?` email is inside the square brackets. – The fourth bird Jan 28 '19 at 17:00
email is definitely not in square brackets. Its inside the squiggly bracket no? – RustyShackleford Jan 28 '19 at 17:01
I thought you wanted to keep email and do that in a separate replace because you used [^a-zA-Z]` If you want to remove all, the you might try `(\[[^][]+\])|.` [demo](https://regex101.com/r/YEJcBP/1) In that case, why not just match the values? `\[[^][]+\]` [demo](https://regex101.com/r/YEJcBP/2) – The fourth bird Jan 28 '19 at 17:03
the above line removes everything. I just want to keep the email address and ID as is in the column, thats it. – RustyShackleford Jan 28 '19 at 17:05
So you have all the data and you want to keep the values between the square brackets only like https://regex101.com/r/cqkXwC/1 – The fourth bird Jan 28 '19 at 17:08
that is exactly correct, could you please show me how to use in a replace? – RustyShackleford Jan 28 '19 at 17:08
@RustyShackleford I have updated my answer accordingly. – The fourth bird Jan 28 '19 at 17:14
I am using it in the line like this is it correct? `df.email = df.email.str.replace(r"'\{'[^']+':\s*\['([^][]+)'\]}'",'',regex=True)` currently this is doing nothing to the column, how do I use it in replace properly? – RustyShackleford Jan 28 '19 at 17:16
I think that now you are replacing it with an empty string, try using the first capturing group instead `df.email = df.email.str.replace(r"'\{'[^']+':\s*\['([^][]+)'\]}'",'\\1',regex=True)` – The fourth bird Jan 28 '19 at 17:19
that also did not work, replaced everything with NaN, see Gianmar response above – RustyShackleford Jan 28 '19 at 17:21
Yes that works but does not take the data structure into account. If will also remove just a single `']` and all that precedes and follows. There are examples of the first capturing group in [this page](https://stackoverflow.com/questions/41472951/using-regex-matched-groups-in-pandas-dataframe-replace-function) – The fourth bird Jan 28 '19 at 17:25

How to drop and keep only certain non alphanumeric characters?

2 Answers2