1

So I'm trying to extract the type of fees related to certain buildings ("EMBARGO" or "HIPOTECA" or "USUFRUCTO") and the beneficiary of these fees (different types of banks --> "Banco ___").

My regular expression is the following one:

(EMBARGO|HIPOTECA\W|USUFRUCTO).+?FAVOR DE[L ](.+?)[,.]

It extracts the word "EMBARGO", "HIPOTECA" or "USUFRUCTO" and then looks for the sentence "FAVOR DE" in order to extract the Bank which is the beneficiary of the fee and whose name ends after reaching a comma or a dot. Most of the times, this code works perftectly but I've found some issues when I have a combination of 2 of the 3 words ("EMBARGO", "HIPOTECA", "USUFRUCTO") in the same sentence because it gets the first one the algorithm finds and it should get the word which is the closest to the words "FAVOR DE".

To solve this, I tried using a negative lookahead: (EMBARGO|HIPOTECA\W|USUFRUCTO)(?!.*(EMBARGO|HIPOTECA\W|USUFRUCTO)).+?FAVOR DE[L ](.+?)[,.] which worked well when the different fees were separated by new lines(\n) but in most of the cases all the data is in the same paragraph so it does not work.

Any ideas to solve this issue?

Text:

HIPOTECA DE 1000 EUROS A FAVOR DE BANCO XYZ, TAMBIÉN DISPONEMOS DE UN EMBARGO QUE EJECUTO SU USUFRUCTO A FAVOR DEL BANCO ABC, ...

What I got without using negative lookahead:

1) HIPOTECA, BANCO XYZ

2) EMBARGO, BANCO ABC

What I got using negative lookahead:

1)

2) USUFRUCTO, BANCO ABC

What I want:

1) HIPOTECA, BANCO XYZ

2) USUFRUCTO, BANCO ABC

Thank you in advance.

Pedro LC
  • 33
  • 4
  • 1
    Negate the left-hand delimiter, `(EMBARGO|HIPOTECA\W|USUFRUCTO)(?:(?!EMBARGO|HIPOTECA\W|USUFRUCTO).)+?FAVOR DE[L ](.+?)[,.]`. You need not just a simple lookahead, you need a [tempered greedy token](https://stackoverflow.com/a/37343088/3832970). – Wiktor Stribiżew Jun 08 '20 at 09:06
  • It's probably easier and more readable to first split the string on commas, and then match the `EMBARGO|HIPOTECA|USUFRUCTO` and `FAVOR DE[L ]` parts separately. – Thomas Jun 08 '20 at 09:07

0 Answers0