0

This is a follow up to this stackoverflow question

Pandas: How to return rows where a column has a line breaks/new line ( \n ) with one of several case-sensitive words coming directly after?

Which gives a solution to return rows which contain one of several case-sensitive words, which follow a new line '\n'.

I would now like to return rows which contain a minimum number of these case-sensitive words which follow a new line.

In the minimal example below, I am trying to get rows which contain at least Three strings from a particular set.

testdf = pd.DataFrame([
    [ ' generates the final summary. \nRESULTS \nMethods We evaluate the performance of ', ], 
                       [ 'the cat and bat \n\n\nRESULTS\n BACKGROUND teamed up to find some food'], 
                       ['anthropology with RESULTS \n\n\nMETHODS\n pharmacology and biology'],
    [ ' generates the final summary. \nMethods \nBACKGROUND We evaluate the performance of ', ], 
                       [ 'the cat and bat \n\n\nMETHODS\n teamed up to find some food'], 
                       ['anthropology with METHODS pharmacology and biology'],
        [ ' generates the final summary. \nBACKGROUND We evaluate the performance of ', ], 
                       [ 'the cat and bat \n\n\nBackground\n teamed up to find some food'], 
                       ['anthropology with \nBACKGROUND with \nRESULTS pharmacology and biology'],
    [ ' generates the final summary. \nBACKGROUND We \nRESULTS  evaluate \nCONCLUSIONS the performance of ', ]  
])
testdf.columns = ['A']
testdf.head(10)

Returns

A
0   generates the final summary. \nRESULTS \nMethods We evaluate the performance of
1   the cat and bat \n\n\nRESULTS\n BACKGROUND teamed up to find some food
2   anthropology with RESULTS \n\n\nMETHODS\n pharmacology and biology
3   generates the final summary. \nMethods \nBACKGROUND We evaluate the performance of
4   the cat and bat \n\n\nMETHODS\n teamed up to find some food
5   anthropology with METHODS pharmacology and biology
6   generates the final summary. \nBACKGROUND We evaluate the performance of
7   the cat and bat \n\n\nBackground\n teamed up to find some food
8   anthropology with \nBACKGROUND with \nRESULTS pharmacology and biology
9   generates the final summary. \nBACKGROUND We \nRESULTS evaluate \nCONCLUSIONS the performance of

And then

listStrings = { '\nRESULTS',  '\nMETHODS' ,  '\nBACKGROUND' , '\nCONCLUSIONS', '\nEXPERIMENT'}
testdf.loc[testdf.A.apply(lambda x: len(listStrings.intersection(x.split())) >= 3)]

Will return nothing.

The desired result will Only return the last row.

9   generates the final summary. \nBACKGROUND We \nRESULTS evaluate \nCONCLUSIONS the performance of

Because that is the only row which contains at least 3 of the specified case-sensitive words which follow a new line.

SantoshGupta7
  • 5,607
  • 14
  • 58
  • 116
  • Thanks!!!!! Learned a lot today. I think I'll have to get the last check to the @wenyoben , that person was technically ahead (by 18 seconds) this time, and also provided answers to the other questions – SantoshGupta7 Jun 17 '19 at 03:28

2 Answers2

1

Check with str.findall

testdf[testdf.A.str.findall('|'.join(listStrings)).str.len()>=3]
                                                   A
9   generates the final summary. \nBACKGROUND We ...
BENY
  • 317,841
  • 20
  • 164
  • 234
1

Use str.findall:

>>> testdf[testdf['A'].str.findall('|'.join(listStrings)).map(len)>=3]
                                                   A
9   generates the final summary. \nBACKGROUND We ...
>>> 
U13-Forward
  • 69,221
  • 14
  • 89
  • 114