2

I Have df that looks like this:

email                                    id
{'email': ['test@test.com']}           {'id': ['123abc_d456_789_fgh']}

when I drop non alphanumeric characters like so:

df.email = df.email.str.replace('[^a-zA-Z]', '')
df.email = df.email.str.replace('email', '')


df.id = df.id.str.replace('[^a-zA-Z]', '')
df.id = df.id.str.replace('id', '')

The columns look like this:

email                    id
testtestcom              123abcd456789fgh

How do I tell the code to not drop anything in the square brackets but drop all non alpha numeric characters outside the brackets?

New df should like this:

email                        id
test@test.com                123abc_d456_789_fgh
RustyShackleford
  • 3,462
  • 9
  • 40
  • 81

2 Answers2

2

This is hardcoded, but works:

df.email = df.email.str.replace(".+\['|'].+", '')
df.id = df.id.str.replace(".+\['|'].+", '')

>>> 'test@test.com'
>>> '123abc_d456_789_fgh'
Gianmar
  • 505
  • 3
  • 13
1

According to the comments, what you might do is capture what is in between the square brackets in a capturing group.

In the replacement use the first capturing group.

\{'[^']+':\s*\['([^][]+)'\]}

That will match

  • \{ Match {
  • '[^']+' Match ', then not ' 1+ times
  • : Match literally
  • \s*\[' Match 0+ times a whitespace character and then [
  • ([^][]+) Capture group, match not [ or ]
  • '\] Match ]
  • } Match literally

Regex demo | Python demo

The fourth bird
  • 154,723
  • 16
  • 55
  • 70
  • `df.email = df.email.str.replace('(\[[^][]+\])|[^a-zA-Z]','')` drops everything but the word `email` – RustyShackleford Jan 28 '19 at 16:58
  • Your question is `How do I tell the code to not drop anything in the square brackets but drop all non alpha numeric characters outside the brackets?` email is inside the square brackets. – The fourth bird Jan 28 '19 at 17:00
  • email is definitely not in square brackets. Its inside the squiggly bracket no? – RustyShackleford Jan 28 '19 at 17:01
  • I thought you wanted to keep email and do that in a separate replace because you used [^a-zA-Z]` If you want to remove all, the you might try `(\[[^][]+\])|.` [demo](https://regex101.com/r/YEJcBP/1) In that case, why not just match the values? `\[[^][]+\]` [demo](https://regex101.com/r/YEJcBP/2) – The fourth bird Jan 28 '19 at 17:03
  • the above line removes everything. I just want to keep the email address and ID as is in the column, thats it. – RustyShackleford Jan 28 '19 at 17:05
  • So you have all the data and you want to keep the values between the square brackets only like https://regex101.com/r/cqkXwC/1 – The fourth bird Jan 28 '19 at 17:08
  • that is exactly correct, could you please show me how to use in a replace? – RustyShackleford Jan 28 '19 at 17:08
  • @RustyShackleford I have updated my answer accordingly. – The fourth bird Jan 28 '19 at 17:14
  • I am using it in the line like this is it correct? `df.email = df.email.str.replace(r"'\{'[^']+':\s*\['([^][]+)'\]}'",'',regex=True)` currently this is doing nothing to the column, how do I use it in replace properly? – RustyShackleford Jan 28 '19 at 17:16
  • I think that now you are replacing it with an empty string, try using the first capturing group instead `df.email = df.email.str.replace(r"'\{'[^']+':\s*\['([^][]+)'\]}'",'\\1',regex=True)` – The fourth bird Jan 28 '19 at 17:19
  • that also did not work, replaced everything with NaN, see Gianmar response above – RustyShackleford Jan 28 '19 at 17:21
  • Yes that works but does not take the data structure into account. If will also remove just a single `']` and all that precedes and follows. There are examples of the first capturing group in [this page](https://stackoverflow.com/questions/41472951/using-regex-matched-groups-in-pandas-dataframe-replace-function) – The fourth bird Jan 28 '19 at 17:25