0
import re

input_text = "Ellos son grandes amigos, pronto ellos se convirtieron en mejores amigos. ellos se vieron en el parque antes de llevar ((PERS)los viejos gabinetes), ya que ellos eran aun útiles para la compañía. Ellos son algo peores que los nuevos modelos."

output_text = re.sub(r"(?<!\(\(PERS\)\s*los\s*)ellos", r"((PERS)ellos NO DATA)", input_text, flags=re.IGNORECASE)

print(repr(output_text)) # --> output

I must replace any remaining occurrence of the substring "ellos" within the input string by ((PERS)ellos NO DATA) , that is, it will be replaced in those cases where there is no sequence ((PERS)\s*los ) before

And to the string that the program received as input, it should be able to convert it into this other string:

"((PERS)ellos NO DATA) son grandes amigos, pronto ((PERS)ellos NO DATA) se convirtieron en mejores amigos. ((PERS)ellos NO DATA) se vieron en el parque antes de llevar ((PERS)los viejos gabinetes), ya que ellos eran aun útiles para la compañía. Ellos son algo peores que los nuevos modelos."

But the problem is that this code when using look-arounds indicates this error re.error: look-behind requires fixed-width pattern

How does this error happen since Python's regex engine requires fixed-width look-behind patterns, the length of the regex in the lookbehind must be a fixed number of characters, but since I use * no longer I am fulfilling that condition, and this error appears. What alternative could I use in this case to avoid this error, and obtain the correct result?

And if I use this pattern instead

re.sub(r"ellos(?!((?<=\()\(PERS\)los ))", r"((PERS)ellos NO DATA)", input_text, flags=re.IGNORECASE)

will replace absolutely all occurrences of the string "ellos" regardless of the restriction that it should replace only when there is no ((PERS) ) before it

getting this wrong output:

'Creo que ((PERS)los viejos gabinetes) estan en desuso, hay que hacer algo con ellos, ya que ellos son importantes. ((PERS)viejos gabinetes) quedaron en el deposito de ((PERS)viejos gabinetes). ((PERS)viejos gabinetes) ((PERS)los cojines) son acolchonados, ellos estan sobre el sofá. creo que ((PERS)cojines) estan sobre el sofá'
Matt095
  • 857
  • 3
  • 9
  • In another question about variable-length lookbehinds, I think someone suggested using the `python-regex` library in place of the built-in `re`. – Barmar Feb 17 '23 at 22:42
  • @Barmar Yes, there is an escape sequence `\K` from the module called `regex` which in theory works for these variable width cases, however, I asked because the ease of applying this is in theory but its implementation in a real example leaves me enough doubts – Matt095 Feb 17 '23 at 22:44
  • You use `regex`, the problems go away. – Wiktor Stribiżew Feb 17 '23 at 23:05
  • @WiktorStribiżew Do you know what pattern I should use for these cases? Because the one in the question you linked to I don't feel like it works – Matt095 Feb 17 '23 at 23:09
  • Doesn't your regex work? See [the demo](https://tio.run/##TZC/TsMwEIf3PMWRyUZRFlYqVEGEWFrUMkZCV3oEV45tzk7UPhQDK2sfLFwaqe3i85/PP3@@cEhf3t0Ng2mD5wRMDe0Bo0yyzLjQpfdE@wQzyCtrfYToHTSMbksRsDWNjwUE9i55oAkg@PCuN5wMyT6Qg5Z2ns98eQH7M0IWAvJ3R4AuCbolsJZ6ZFDqtVqt9XhD8J2UBjfGkVC6gAPCeGlKJPEC7Bwc/5KxkiKRCBZFqA14/D3@YAmXb6BtPAQ6qY0h477rqJfS@i3JssyzzHfpqgtMZew2inP1cH9Tq/okV@s63gouoz6Z5AUIMYlPaoslPM3f5lpOLl0t4NNiE2cS@vK8WK6qx/m60lkW2LikmAKrq9e1HoZ/). – Wiktor Stribiżew Feb 18 '23 at 13:13
  • This regex doent work well – Matt095 Feb 18 '23 at 17:33

0 Answers0