Getting a regex function right

Question

I work with the following regex function:

def datesearcher(comment):
    matches = re.findall(
                     """(\d{2}\.Jan.\s\d{4}\sMitarbeiter\s)|(\d{2}\.Feb.\s\d{4}\sMitarbeiter\s)|(\d{2}\.März\s\d{4}\sMitarbeiter\s)
                     |(\d{2}\.Apr.\s\d{4}\sMitarbeiter\s)|(\d{2}\.Mai\s\d{4}\sMitarbeiter\s)|(\d{2}\.Juni\s\d{4}\sMitarbeiter\s)
                     |(\d{2}\.Juli\s\d{4}\sMitarbeiter\s)|(\d{2}\.Aug.\s\d{4}\sMitarbeiter\s)|(\d{2}\.Sep.\s\d{4}\sMitarbeiter\s)
                     |(\d{2}\.Okt.\s\d{4}\sMitarbeiter\s)|(\d{2}\.Nov.\s\d{4}\sMitarbeiter\s)|(\d{2}\.Dez.\s\d{4}\sMitarbeiter\s)""", comment
                     )
    return matches

Basically I try to find dates in a string that are always followed by the same word. An example would be (please excuse the german):

 examplestring = "some text at the beginning 18.Jan 2017 Mitarbeiter some more text following or even more and more and more"

This should return:

[(18.Jan 2017,,,,,,,,,,,)]

Afterwards I want to apply it on a pandas table.

df["date"] = df["texts"].apply(datesearcher)

The regex only returns [], even though I tested it with https://regex101.com/ Can anyone help? Thank you!

But you did not use the `re.X` modifier. Add `(?x)` at the pattern start. I am sure you had it on when testing since you say you had matches online. This part of the question is a typo. — Wiktor Stribiżew, Oct 04 '18 at 13:27
:) If you had `(?x)` as you had online, the first comment would make the difference since that solves the real problem. — Wiktor Stribiżew, Oct 04 '18 at 13:37
True, thanks so much :-) Is there any reason why the function would not work with panda's apply function (target column is in string format)? It works fine with the testsentence now, but still returns [] when applied to the pandas dataframe. — cian, Oct 04 '18 at 14:12
Sorry, I did not notice the comment. I hope you made it. I just want to note that the pattern can be contracted to `r'(\d{2}\.(?:Jan|Feb|März|Apr|Mai|Ju[nl]i|Aug|Sep|Okt|Nov|Dez)\.\s\d{4}\sMitarbeiter\s)'` or even `\b(\d{1,2}\.(?:Jan|Feb|März|Apr|Mai|Ju[nl]i|Aug|Sep|Okt|Nov|Dez)\.\s\d{4}\sMitarbeiter)\b`, and you may actually use `df["texts"].str.findall(pattern)`. Also, the dot after months should be escaped if it must match a real, literal dot. — Wiktor Stribiżew, Mar 19 '19 at 17:57

Getting a regex function right

0 Answers0