2

I want to extract substring between apple and each in a string. However, if each is followed by box, I want the result be an empty string.

In details, it means:

1)apple costs 5 dollars each -> costs 5 dollars

2)apple costs 5 dollars each box -> ``

I tried re.findall('(?<=apple)(.*?)(?=each)')).

It can tackle 1) but not 2).

How to solve the problem?

Thanks.

The fourth bird
  • 154,723
  • 16
  • 55
  • 70
Chan
  • 3,605
  • 9
  • 29
  • 60

2 Answers2

2

You could add a negative lookahead, asserting what is on the right is not box. For a match only you can omit the capturing group.

(?<=apple).*?(?=each(?! box))

Regex demo

If you don't want to match the leading space, you could add that to the lookarounds

import re
s = "apple costs 5 dollars each"
print(re.findall(r'(?<=apple ).*?(?= each(?! box))', s))

Output

['costs 5 dollars']

You can also use a capturing group without the positive lookaheads and use the negative lookahead only. The value is in the first capturing group.

You could make use of word boundaries \b to prevent the word being part of a larger word.

\bapple\b(.*?)\beach\b(?! box)

Regex demo

The fourth bird
  • 154,723
  • 16
  • 55
  • 70
2

try this without using regex:

myString = "apple costs 5 dollars each box"

myList = myString.split(" ")

storeString = []

for x in myList:

    if x == "apple":
        continue

    elif x == "each":
        break

    else:

        storeString.append(x)

# using list comprehension 
listToStr = ' '.join(map(str, storeString))

print(listToStr)

Output:

enter image description here

Alok Mishra
  • 694
  • 5
  • 20