1

I have a list string, and I want extract three patterns and form a dataframe. Each string has three part, first part, s_ or t_q_ or NaN; the middle pattern could be any letter, number and _, but cannot end with abc; last part, should be abc or NaN

import pandas as pd
import re

str_list = ['s_c45abc','s_ab00_a','t_q_de45abc','t_q_123','t_q_c34b7_da','456a','456abc','456b']

pd.Series(str_list).str.extract(r"(s_|t_q_)?(\w+[^(abc)])(abc)?")


    0       1         2
0   s_      c45       abc
1   s_      ab00_     NaN
2   t_q_    de45      abc
3   t_q_    123       NaN
4   t_q_    c34b7_d   NaN
5   NaN     456       NaN
6   NaN     456       abc
7   NaN     456       NaN

However, the second/fifth/sixth/last rows are incorrect. The true result should be

    0       1             2
0   s_      c45           abc
1   s_      ab00_a        NaN
2   t_q_    de45          abc
3   t_q_    123           NaN
4   t_q_    c34b7_da      NaN
5   NaN     456a          NaN
6   NaN     456           abc
7   NaN     456b          NaN
nnnnnn0000
  • 13
  • 3

1 Answers1

1

You've made some mistakes in your regex which imply that you do not know how regular expressions work. You should review the regular expression tutorials to make sure you understand what all of the parts are doing.

(s_|t_q_)?(\w+[^(abc)])(abc)?

First off, [^(abc)] matches a single character that is not (, a, b, c, or ). Refer to https://stackoverflow.com/a/406408/670693 for how to match a string which does not contain a specific substring, not that you need to, but it seems relevant to what you were trying to do.

What you want to do is pretty odd, but possible to a certain extent. Your (s_|t_q_)? seems fine. The (\w+[^(abc)]) is wrong as I've stated before, but it needs to be modified to (\w+?)?? to do a non-greedy ?? 0-or-1 match. This is because your end (abc)? is enough to make sure that (\w+?)?? does not match "abc" at the end, but only if you only have one "abc" at the end. If you have ".*abcabc" then your column 1 will end with "abc" and your column 2 will be "abc" as well. In the case of just "abc" you need the ?? so that your output is NaN, NaN, abc.

This leaves you with

^(s_|t_q_)?(\w+?)??(abc)?$

If you want to make sure that all(e[-3:] != 'abc' for e in df[1] if type(e) is float and not math.isnan(e)) == True, then you need to modify your regex to:

^(s_|t_q_)?(\w+?)??(abc)*$

The key changes here are to (abc)* where we capture as many abc strings at the end as we can. The existing ?? for (\w+?)?? should be enough to make sure that part works as expected.

justhecuke
  • 765
  • 4
  • 8