Extract relevant rows from pandas dataframe when duplicate column values are present

Question

I have a pandas data frame as follows:

id	left	top	width	height	Text
1	12	34	12	34	commercial
2	99	42	99	42	general
3	1	47	9	4	liability
4	10	69	32	67	commercial
5	99	72	79	88	available

I want to extract specific rows based on the column value Text. So I want to search for certain keyphrases like liability commercial using re.search in the column Text and if I get a match then extract the rows i.e. 3rd and 4th row. So if the input is liability commercial then the output should be the following rows extracted:

id	left	top	width	height	Text
3	1	47	9	4	liability
4	10	69	32	67	commercial

Keep in mind that the column Text may contain duplicate values. So in the above case, there are 2 rows with the word commerial present.

Thanks in advance!

`Keep in mind that the column Text may contain duplicate values. So in the above case, there are 2 rows with the word commerial present.` - What is algo for match? Last match? — jezrael, Jul 06 '23 at 06:50
@jezrael It depends on the keyphrase. If I want to extract `liability commercial`, then the first `commercial` should not be extracted. If I want to extract `commercial general liability` then the second `commercial` should not be extracted. Basically depends on the keyphrase we are trying to extract! — spectre, Jul 06 '23 at 06:55

jezrael · Accepted Answer · 2023-07-06T07:04:07.227

1

Use:

phrase = 'liability commercial'

#match by substrings - splitted values by spaces
m = df['Text'].str.contains(phrase.replace(' ','|'))
#match by splitted values by spaces
m = df['Text'].isin(phrase.split())

#filter rows by mask and get last duplicated values in Text column
df = df[m].drop_duplicates(['Text'], keep='last')
print (df)
   id  left  top  width  height        Text
2   3     1   47      9       4   liability
3   4    10   69     32      67  commercial

Or if need groups by matched rows by conditions change mask, here position of splitted values ad possible duplicates not counts:

phrase = 'liability commercial'
m = ~df['Text'].str.contains(phrase.replace(' ','|'))
#m = ~df['Text'].isin(phrase.split())

df = df[m.cumsum().duplicated(keep=False) & ~m]
print (df)
   id  left  top  width  height        Text
2   3     1   47      9       4   liability
3   4    10   69     32      67  commercial

If need match by exactly matched by splitted values is possible modify this solution:

phrase = 'liability commercial'

#https://stackoverflow.com/a/49005205/2901002
pat = np.asarray(phrase.split())
N = len(pat)


def rolling_window(a, window):
    shape = a.shape[:-1] + (a.shape[-1] - window + 1, window)
    strides = a.strides + (a.strides[-1],)
    c = np.lib.stride_tricks.as_strided(a, shape=shape, strides=strides)
    return c

arr = df['Text'].values
b = np.all(rolling_window(arr, N) == pat, axis=1)
c = np.mgrid[0:len(b)][b]

d = [i  for x in c for i in range(x, x+N)]
df = df[np.in1d(np.arange(len(arr)), d)]
print (df)
   id  left  top  width  height        Text
2   3     1   47      9       4   liability
3   4    10   69     32      67  commercial

edited Jul 06 '23 at 07:04

answered Jul 06 '23 at 06:47

jezrael

822,522
95
1,334
1,252

I spent the last 24 hours trying all 3 of your solutions. The first one doesn't work outright because of the `keep = "first"/"last"`. The second solution also does not work in my case as it will extract all the duplicate words and it is impossible to separate the required ones. – spectre Jul 07 '23 at 06:00
@spectre - OK, so third solution working well? – jezrael Jul 07 '23 at 06:01
For the third solution I am testing multiple scenarios and so far it is working. My only issue is that I wanted to use `regex` to search for the phrase as it is more robust. Is there any way the third solution can be modified to use `re.search`? – spectre Jul 07 '23 at 06:02
@spectre - Can you give me some example for using `re.search` ? – jezrael Jul 07 '23 at 06:03
`match = re.search(r".?liability.?\s.?commercial.?", flags = re.DOTALL|re.IGNORECASE)`. Use this as an example. If there is a match then only go ahead with the third solution. – spectre Jul 07 '23 at 06:07
@spectre - OK, so is possible always know what number of rows are tested? E.g. for `match = re.search(r".?liability.?\s.?commercial.?")` are tested 2 consecutive rows? – jezrael Jul 07 '23 at 06:09
The words in `phrase` will always be consecutive. The number of rows tested is the total number of rows. So we will search all the rows of the dataframe for the phrase `liability commercial` using `re.search` and if there is a match then go ahead with the 3rd solution. You can use `match.group()` to get the string of the match term if you want – spectre Jul 07 '23 at 06:13
@spectre - yes, but I think `match = re.search(r".?liability.?\s.?commercial.?", flags = re.DOTALL|re.IGNORECASE)` - here need test 2 consecutive values - 1.+2.nd row, 2.+3rd row... – jezrael Jul 07 '23 at 06:23
@spectre - because if test 1. value, 1.+2.nd, then 1.+2.+3.rd, ... it is possible, but slow. How many rows is in DataFrame? – jezrael Jul 07 '23 at 06:26
`phrase = r".?liability.?\s.?commercial.?"` and then `match = re.search(phrase, ' '.join([i for i in d['text'].tolist() if i is not np.nan]), flags = re.DOTALL|re.IGNORECASE)`. You can use this to search for the phrase in the text colum and if match then do 3rd solution – spectre Jul 07 '23 at 06:31
@spectre Hmmm. If general solution it means is possible multiple words per value? Like instead `liability` is `any liability` ? – jezrael Jul 07 '23 at 06:39
@spectre - And for my comments above - is not possible distinguiush from phrase number of rows matched? E.g. `phrase = r".?liability.?\s.?commercial.?\s.?available.?"` should be in one value, in 2 values or in 3 values? – jezrael Jul 07 '23 at 06:41
If the phrase contains the words `any liability` then yes we have to search for the same exact words. – spectre Jul 07 '23 at 06:41
The number of words per row will most probably be one. So each row will contain 1 word – spectre Jul 07 '23 at 06:42
Also do not use `re.IGNORECASE`. Just use `re.DOTALL`! – spectre Jul 07 '23 at 06:53
OK, so is possible use distinguish number of consecutive words by `phrase = r".?liability.?\s.?commercial.?" N = len(phrase.split())` ? – jezrael Jul 07 '23 at 07:08
Let us [continue this discussion in chat](https://chat.stackoverflow.com/rooms/254395/discussion-between-spectre-and-jezrael). – spectre Jul 07 '23 at 07:09

Extract relevant rows from pandas dataframe when duplicate column values are present

1 Answers1