So I'm trying to extract the type of fees related to certain buildings ("EMBARGO" or "HIPOTECA" or "USUFRUCTO") and the beneficiary of these fees (different types of banks --> "Banco ___").
My regular expression is the following one:
(EMBARGO|HIPOTECA\W|USUFRUCTO).+?FAVOR DE[L ](.+?)[,.]
It extracts the word "EMBARGO", "HIPOTECA" or "USUFRUCTO" and then looks for the sentence "FAVOR DE" in order to extract the Bank which is the beneficiary of the fee and whose name ends after reaching a comma or a dot. Most of the times, this code works perftectly but I've found some issues when I have a combination of 2 of the 3 words ("EMBARGO", "HIPOTECA", "USUFRUCTO") in the same sentence because it gets the first one the algorithm finds and it should get the word which is the closest to the words "FAVOR DE".
To solve this, I tried using a negative lookahead:
(EMBARGO|HIPOTECA\W|USUFRUCTO)(?!.*(EMBARGO|HIPOTECA\W|USUFRUCTO)).+?FAVOR DE[L ](.+?)[,.]
which worked well when the different fees were separated by new lines(\n) but in most of the cases all the data is in the same paragraph so it does not work.
Any ideas to solve this issue?
Text:
HIPOTECA DE 1000 EUROS A FAVOR DE BANCO XYZ, TAMBIÉN DISPONEMOS DE UN EMBARGO QUE EJECUTO SU USUFRUCTO A FAVOR DEL BANCO ABC, ...
What I got without using negative lookahead:
1) HIPOTECA, BANCO XYZ
2) EMBARGO, BANCO ABC
What I got using negative lookahead:
1)
2) USUFRUCTO, BANCO ABC
What I want:
1) HIPOTECA, BANCO XYZ
2) USUFRUCTO, BANCO ABC
Thank you in advance.