Capturing pattern in a DataFrame and counting its length

Question

I have a dataframe which looks like this:

>>>df
    Str
0   .....(((((................((((.(...).))))(((((....))))).(((((((((...))))))).))))))).................
1   .(((((..(((....(((((((........)))))))..)))..))).)).((((((((((.((..(((....)))....)).)))))))))).......
2   ((((((.(((((.(((...))))))))))....(((.((((.((.(((....))).)).))))..)))))))(..((((...))))..)...........
3   (((((((.((....((.((.((((..((.......(((...))).((((((((...))))))))....))..)))).)).))....))..)))))))...

I want to capture the portion starting from first opening bracket to the last opening bracket. I tried the following code for that:

df["stem"] = df["Str"].str.findall('[(][(.)]+[)]')
df["stem"] = df["stem"].astype("str")

The code does capture the blocks but prints it inside an array starting and ending with square brackets:

['regexblock']

>>>df
        Str stem
0   .....(((((................((((.(...).))))(((((....))))).(((((((((...))))))).))))))).................    ['(((((................((((.(...).))))(((((....))))).(((((((((...))))))).)))))))']
1   .(((((..(((....(((((((........)))))))..)))..))).)).((((((((((.((..(((....)))....)).)))))))))).......    ['(((((..(((....(((((((........)))))))..)))..))).)).((((((((((.((..(((....)))....)).))))))))))']
2   ((((((.(((((.(((...))))))))))....(((.((((.((.(((....))).)).))))..)))))))(..((((...))))..)...........    ['((((((.(((((.(((...))))))))))....(((.((((.((.(((....))).)).))))..)))))))(..((((...))))..)']
3   (((((((.((....((.((.((((..((.......(((...))).((((((((...))))))))....))..)))).)).))....))..)))))))...    ['(((((((.((....((.((.((((..((.......(((...))).((((((((...))))))))....))..)))).)).))....))..)))))))']

I need to find the length of each block, but due to this addition of special characters I get 4 extra counts. Is there anyway to get rid of these characters while handling regex?
Thanks in advance.

yes but along with the dots inside it. I need to find the total length of the string. — sloth14, Oct 18 '19 at 07:26
for index, row in df.iterrows(): print(len(row["stem"])-4) does the trick. But I'm asking for some other efficient solution cause I may need to use the block in future — sloth14, Oct 18 '19 at 07:27
`str.findall` returns a list. Use `df["Str"].str.extract('([(][(.)]+[)])')` instead. — Henry Yik, Oct 18 '19 at 07:28

score 0 · Answer 1 · answered Oct 18 '19 at 07:39

0

This one should do it:

df['Str'].str.extract('(\(.*\))')

answered Oct 18 '19 at 07:39

zipa

27,316
6
40
58

That works. Thank you – sloth14 Oct 21 '19 at 16:25

Capturing pattern in a DataFrame and counting its length

1 Answers1