I have a string that looks like:
phrase = '5 hampshire road bradford on avon avon dinas powys powys north somerset hampshire avon'
I want to return a new string with certain words removed, only if they are not preceded by certain other words.
For example, the words I want to remove are:
c_out = ["avon", "powys", "somerset","hampshire"]
Only if they do not follow:
c_except = ["on\s","dinas\s"]
Note: There could be multiple instances of words within c_out
, and multiple instances of words within c_except
.
Individually I tried for 'on\s'
:
phrase = '5 hampshire road bradford on avon avon dinas powys powys north somerset hampshire avon'
regexp1 = re.compile(r'(?<!on\s)(avon|powys|somerset|hampshire)')
print("1st Result: ", regexp1.sub('', phrase))
1st Result: '5 road bradford on avon avon dinas north'
This correctly ignores the first 'avon'
, as it is preceded by 'on\s'
, it correctly removes the third 'avon'
, but it ignores the second 'avon'
(which it does not remove).
In the same way, for 'dinas\s'
:
phrase = '5 hampshire road bradford on avon avon dinas powys powys north somerset hampshire avon'
regexp2 = re.compile(r'(?<!dinas\s)(avon|powys|somerset|hampshire)')
print("2nd Result: ", regexp2.sub('', phrase))
2nd Result: '5 road bradford on dinas powys north '
This correctly ignores the first 'powys'
and removes the second (note the double space between '... powys north'
.
I tried to combine the two expressions by doing:
regexp3 = re.compile(r'((?!on\s)|(?!dinas\s))(avon|powys|somerset|hampshire)')
print("3rd Result: ", regexp3.sub('', phrase))
3rd Result: 5 road bradford on dinas north
This incorrectly removed every word, and completely ignored 'on\s'
or 'dinas\s'
.
Then I tried:
regexp4 = re.compile(r'(?<!on\s|dinas\s)(avon|powys|somerset|hampshire)')
print("4th Result: ", regexp4.sub('', phrase))
And got:
error: look-behind requires fixed-width pattern
I want to end up with:
Result: '5 road bradford on avon dinas powys north '
I have had a look at:
Why is this not a fixed width pattern? Python Regex Engine - "look-behind requires fixed-width pattern" Error regex: string with optional parts
But to no avail.
What am I doing wrong?
From comments:
regexp5 = re.compile(r'(?<!on\s)(?<!dinas\s)(avon|powys|somerset|hampshire)')
print("5th Result: ", regexp5.sub('', phrase))
5th Result: 5 road bradford on avon avon dinas powys north
Again this misses the second avon.