Pandas dataframe: split long regex into multiple lines

Question

I'm using Pandas for some data cleanup, and I have a very long regex which I would like to split into multiple lines. The following works fine in Pandas because it is all on one line:

df['REMARKS'] = df['REMARKS'].replace(to_replace =r'(?=[^\])}]*([\[({]|$))\b(?:GR|MDT|CMR|HLDS|NEXT|NGI|MDTS|RES|PPC|IND|FDC|CNL)\b(?:\s*(?:,\s*)?(?:(?:or|and)\s+)?(?:GR|MDT|CMR|HLDS|NEXT|NGI|MDTS|RES|PPC|IND|FDC|CNL))*\b', value = r'<\g<0>>', regex = True)

However, it is difficult to manage. I've tried the following verbose method which works in regular Python:

df['REMARKS'] = df['REMARKS'].replace(to_replace =r"""(?=[^\])}]*([\[({]|$))
                                                      \b(?:GR|MDT|CMR|HLDS|NEXT|NGI|MDTS|RES|PPC|IND|FDC|CNL)
                                                      \b(?:\s*(?:,\s*)?(?:(?:or|and)\s+)?
                                                      (?:GR|MDT|CMR|HLDS|NEXT|NGI|MDTS|RES|PPC|IND|FDC|CNL))*\b""", value = r'<\g<0>>', regex = True)

This does not work in Pandas, though. Any ideas what I'm missing?

Here is some sample text for testing:

GR, MDT, CMR, HLDS, NEXT, NGI @ 25273, COMPTG

FIT 13.72 ON 9-7/8 LNR, LWD[GR,RES,APWD,SONVIS], MDTS (PRESS & SAMP) ROT SWC, TSTG BOP

LWD[GR,RES,APWD,SONVIS], GR, RES, NGI, PPC @ 31937, MDTS (PRESS & SAMP) TKG ROT SWC

LWD[GR,RES] @ 12586, IND, FDC, CNL, GR @ 12586, SWC, RAN CSG, PF 12240-12252, RR (ADDED INFO)

Thanks!

It_is_Chris · Accepted Answer · 2021-01-12T20:10:37.947

1

One option is to create a list of strings and then use join when you call replace

RegEx = [r'(?=[^\])}]*([\[({]|$))\b(?:GR|MDT|CMR|HLDS|NEXT|NGI|MDTS|RES|PPC|IND|FDC|CNL)',
         r'\b(?:\s*(?:,\s*)?(?:(?:or|and)\s+)?',
         r'(?:GR|MDT|CMR|HLDS|NEXT|NGI|MDTS|RES|PPC|IND|FDC|CNL))*\b']

df['REMARKS'] = df['REMARKS'].replace(to_replace=''.join(RegEx), value=r'<\g<0>>', regex=True)

Using re

import re

s = r"""(?=[^\])}]*([\[({]|$))\b(?:GR|MDT|CMR|HLDS|NEXT|NGI|MDTS|RES|PPC|IND|FDC|CNL)
         \b(?:\s*(?:,\s*)?(?:(?:or|and)\s+)?
         (?:GR|MDT|CMR|HLDS|NEXT|NGI|MDTS|RES|PPC|IND|FDC|CNL))*\b"""

df['REMARKS'] = df['REMARKS'].replace(to_replace=re.compile(s, re.VERBOSE), value=r'<\g<0>>')

edited Jan 12 '21 at 20:10

answered Jan 12 '21 at 17:09

It_is_Chris

13,504
2
23
41

Thanks for the idea, Chris. It is strange, though, that we'd have to jump through these types of hoops in Pandas. – Heather Jan 12 '21 at 19:22
@Heather it is more of a regex issue. In python, you can use a \ to terminate a line and continue it on the next. However, \ in regex means something differnt so that is not an option. Also using triple quotes `"""` is not an option because each return will insert a `\n` – It_is_Chris Jan 12 '21 at 19:26
If that's the case, what about the recommendation to use the regex verbose method mentioned here https://stackoverflow.com/questions/33211404/python-how-do-i-do-line-continuation-with-a-long-regex? It specifically mentions that the triple quotes denote verbose in regex. – Heather Jan 12 '21 at 19:36
@Heather if you want to use `re.VERBOSE` then you will need to import the `re` package and use `re.compile` on the string I will update my answer to reflect that as an option. Either way, you will still need to "compile" the string whether it is by using `join` or `re.compile` – It_is_Chris Jan 12 '21 at 20:09

Pandas dataframe: split long regex into multiple lines

1 Answers1