1

I have a pandas data frame as follows:

id left top width height Text
1 12 34 12 34 commercial
2 99 42 99 42 general
3 1 47 9 4 liability
4 10 69 32 67 commercial
5 99 72 79 88 available

I want to extract specific rows based on the column value Text. So I want to search for certain keyphrases like liability commercial using re.search in the column Text and if I get a match then extract the rows i.e. 3rd and 4th row. So if the input is liability commercial then the output should be the following rows extracted:

id left top width height Text
3 1 47 9 4 liability
4 10 69 32 67 commercial

Keep in mind that the column Text may contain duplicate values. So in the above case, there are 2 rows with the word commerial present.

Thanks in advance!

spectre
  • 717
  • 7
  • 21
  • `Keep in mind that the column Text may contain duplicate values. So in the above case, there are 2 rows with the word commerial present.` - What is algo for match? Last match? – jezrael Jul 06 '23 at 06:50
  • @jezrael It depends on the keyphrase. If I want to extract `liability commercial`, then the first `commercial` should not be extracted. If I want to extract `commercial general liability` then the second `commercial` should not be extracted. Basically depends on the keyphrase we are trying to extract! – spectre Jul 06 '23 at 06:55

1 Answers1

1

Use:

phrase = 'liability commercial'

#match by substrings - splitted values by spaces
m = df['Text'].str.contains(phrase.replace(' ','|'))
#match by splitted values by spaces
m = df['Text'].isin(phrase.split())

#filter rows by mask and get last duplicated values in Text column
df = df[m].drop_duplicates(['Text'], keep='last')
print (df)
   id  left  top  width  height        Text
2   3     1   47      9       4   liability
3   4    10   69     32      67  commercial

Or if need groups by matched rows by conditions change mask, here position of splitted values ad possible duplicates not counts:

phrase = 'liability commercial'
m = ~df['Text'].str.contains(phrase.replace(' ','|'))
#m = ~df['Text'].isin(phrase.split())

df = df[m.cumsum().duplicated(keep=False) & ~m]
print (df)
   id  left  top  width  height        Text
2   3     1   47      9       4   liability
3   4    10   69     32      67  commercial

If need match by exactly matched by splitted values is possible modify this solution:

phrase = 'liability commercial'

#https://stackoverflow.com/a/49005205/2901002
pat = np.asarray(phrase.split())
N = len(pat)


def rolling_window(a, window):
    shape = a.shape[:-1] + (a.shape[-1] - window + 1, window)
    strides = a.strides + (a.strides[-1],)
    c = np.lib.stride_tricks.as_strided(a, shape=shape, strides=strides)
    return c

arr = df['Text'].values
b = np.all(rolling_window(arr, N) == pat, axis=1)
c = np.mgrid[0:len(b)][b]

d = [i  for x in c for i in range(x, x+N)]
df = df[np.in1d(np.arange(len(arr)), d)]
print (df)
   id  left  top  width  height        Text
2   3     1   47      9       4   liability
3   4    10   69     32      67  commercial
jezrael
  • 822,522
  • 95
  • 1,334
  • 1,252
  • I spent the last 24 hours trying all 3 of your solutions. The first one doesn't work outright because of the `keep = "first"/"last"`. The second solution also does not work in my case as it will extract all the duplicate words and it is impossible to separate the required ones. – spectre Jul 07 '23 at 06:00
  • @spectre - OK, so third solution working well? – jezrael Jul 07 '23 at 06:01
  • For the third solution I am testing multiple scenarios and so far it is working. My only issue is that I wanted to use `regex` to search for the phrase as it is more robust. Is there any way the third solution can be modified to use `re.search`? – spectre Jul 07 '23 at 06:02
  • @spectre - Can you give me some example for using `re.search` ? – jezrael Jul 07 '23 at 06:03
  • `match = re.search(r".?liability.?\s.?commercial.?", flags = re.DOTALL|re.IGNORECASE)`. Use this as an example. If there is a match then only go ahead with the third solution. – spectre Jul 07 '23 at 06:07
  • @spectre - OK, so is possible always know what number of rows are tested? E.g. for `match = re.search(r".?liability.?\s.?commercial.?")` are tested 2 consecutive rows? – jezrael Jul 07 '23 at 06:09
  • The words in `phrase` will always be consecutive. The number of rows tested is the total number of rows. So we will search all the rows of the dataframe for the phrase `liability commercial` using `re.search` and if there is a match then go ahead with the 3rd solution. You can use `match.group()` to get the string of the match term if you want – spectre Jul 07 '23 at 06:13
  • @spectre - yes, but I think `match = re.search(r".?liability.?\s.?commercial.?", flags = re.DOTALL|re.IGNORECASE)` - here need test 2 consecutive values - 1.+2.nd row, 2.+3rd row... – jezrael Jul 07 '23 at 06:23
  • @spectre - because if test 1. value, 1.+2.nd, then 1.+2.+3.rd, ... it is possible, but slow. How many rows is in DataFrame? – jezrael Jul 07 '23 at 06:26
  • `phrase = r".?liability.?\s.?commercial.?"` and then `match = re.search(phrase, ' '.join([i for i in d['text'].tolist() if i is not np.nan]), flags = re.DOTALL|re.IGNORECASE)`. You can use this to search for the phrase in the text colum and if match then do 3rd solution – spectre Jul 07 '23 at 06:31
  • @spectre Hmmm. If general solution it means is possible multiple words per value? Like instead `liability` is `any liability` ? – jezrael Jul 07 '23 at 06:39
  • @spectre - And for my comments above - is not possible distinguiush from phrase number of rows matched? E.g. `phrase = r".?liability.?\s.?commercial.?\s.?available.?"` should be in one value, in 2 values or in 3 values? – jezrael Jul 07 '23 at 06:41
  • If the phrase contains the words `any liability` then yes we have to search for the same exact words. – spectre Jul 07 '23 at 06:41
  • The number of words per row will most probably be one. So each row will contain 1 word – spectre Jul 07 '23 at 06:42
  • Also do not use `re.IGNORECASE`. Just use `re.DOTALL`! – spectre Jul 07 '23 at 06:53
  • OK, so is possible use distinguish number of consecutive words by `phrase = r".?liability.?\s.?commercial.?" N = len(phrase.split())` ? – jezrael Jul 07 '23 at 07:08
  • Let us [continue this discussion in chat](https://chat.stackoverflow.com/rooms/254395/discussion-between-spectre-and-jezrael). – spectre Jul 07 '23 at 07:09