This is a follow up to this stackoverflow question
Which gives a solution to return rows which contain one of several case-sensitive words, which follow a new line '\n'.
I would now like to return rows which contain a minimum number of these case-sensitive words which follow a new line.
In the minimal example below, I am trying to get rows which contain at least Three strings from a particular set.
testdf = pd.DataFrame([
[ ' generates the final summary. \nRESULTS \nMethods We evaluate the performance of ', ],
[ 'the cat and bat \n\n\nRESULTS\n BACKGROUND teamed up to find some food'],
['anthropology with RESULTS \n\n\nMETHODS\n pharmacology and biology'],
[ ' generates the final summary. \nMethods \nBACKGROUND We evaluate the performance of ', ],
[ 'the cat and bat \n\n\nMETHODS\n teamed up to find some food'],
['anthropology with METHODS pharmacology and biology'],
[ ' generates the final summary. \nBACKGROUND We evaluate the performance of ', ],
[ 'the cat and bat \n\n\nBackground\n teamed up to find some food'],
['anthropology with \nBACKGROUND with \nRESULTS pharmacology and biology'],
[ ' generates the final summary. \nBACKGROUND We \nRESULTS evaluate \nCONCLUSIONS the performance of ', ]
])
testdf.columns = ['A']
testdf.head(10)
Returns
A
0 generates the final summary. \nRESULTS \nMethods We evaluate the performance of
1 the cat and bat \n\n\nRESULTS\n BACKGROUND teamed up to find some food
2 anthropology with RESULTS \n\n\nMETHODS\n pharmacology and biology
3 generates the final summary. \nMethods \nBACKGROUND We evaluate the performance of
4 the cat and bat \n\n\nMETHODS\n teamed up to find some food
5 anthropology with METHODS pharmacology and biology
6 generates the final summary. \nBACKGROUND We evaluate the performance of
7 the cat and bat \n\n\nBackground\n teamed up to find some food
8 anthropology with \nBACKGROUND with \nRESULTS pharmacology and biology
9 generates the final summary. \nBACKGROUND We \nRESULTS evaluate \nCONCLUSIONS the performance of
And then
listStrings = { '\nRESULTS', '\nMETHODS' , '\nBACKGROUND' , '\nCONCLUSIONS', '\nEXPERIMENT'}
testdf.loc[testdf.A.apply(lambda x: len(listStrings.intersection(x.split())) >= 3)]
Will return nothing.
The desired result will Only return the last row.
9 generates the final summary. \nBACKGROUND We \nRESULTS evaluate \nCONCLUSIONS the performance of
Because that is the only row which contains at least 3 of the specified case-sensitive words which follow a new line.