2

so I am having trouble with Pandas for a series findall(). currently I am trying to look at a report and retrieving all the electric components. Currently the report is either a line or a paragraph and mention components in a standardize way. I am using this code

failedCoFromReason =rlist['report'].str.findall(r'([CULJRQF]([\dV]{2,4}))',flags=re.IGNORECASE)

It returns the components but it also returns a repeat value of the number like this [('r919', '919'), ('r920', '920')]

I would like it just to return [('r919'), ('r920')] but I am struggling with getting it to work. Pretty new to pandas and regex and confused how to search. I have tried greedy and non greedy searches but it didn't work.

Wiktor Stribiżew
  • 607,720
  • 39
  • 448
  • 563

1 Answers1

3

See the Series.str.findall reference:

Equivalent to applying re.findall() to all the elements in the Series/Index.

The re.findall references says that "if one or more groups are present in the pattern, return a list of groups; this will be a list of tuples if the pattern has more than one group."

So, all you need to do is actually remove all capturing parentheses in this case, as all you need is to get the whole match:

rlist['report'].str.findall(r'[CULJRQF][\dV]{2,4}', flags=re.I)

In other cases, when you need to preserve the group (to quantify it, or to use alternatives), you need to change the capturing groups to non-capturing ones:

rlist['report'].str.findall(r'(?:[CULJRQF](?:[\dV]{2,4}))', flags=re.I)

Though, in this case, it is quite redundant.

Wiktor Stribiżew
  • 607,720
  • 39
  • 448
  • 563