-1
import re
txt = '<li>one. URL : <a href="http://local.ru">http://local.ru</a> (10.02.2022).</li><li>Two</li><li>Three. URL : <a href="https://local.ru">https://local.ru</a> (15.11.2021).</li>'
re.findall(r'(<li>.*?)\s?URL\s?:\s?(<a.*?>).*?(</a>.*?</li>)', txt)

I need gen output

[('<li>one.', '<a href="http://local.ru">', '</a> (10.02.2022).</li>'),
 ('<li>Three.', '<a href="https://local.ru">', '</a> (15.11.2021).</li>')]

If without the first brackets, then it works. But it does not output the text

Vadim Nva
  • 1
  • 1
  • Does this answer your question? [RegEx match open tags except XHTML self-contained tags](/q/1732348/90527) – outis Aug 11 '22 at 18:13

1 Answers1

0

Seems like your regex was too generous on the .*?, if you limit to non-node with [^<>], then you get the expected output.

import re

txt = (
    '<li>one. URL : <a href="http://local.ru">http://local.ru</a> (10.02.2022).</li>'
    '<li>Two</li>'
    '<li>Three. URL : <a href="https://local.ru">https://local.ru</a> (15.11.2021).</li>'
    )

re.findall(r"(<li>[^<>]*?)\s?URL\s?:\s?(<a[^>]*?>).*?(</a>.*?</li>)", txt)

gives

[('<li>one.', '<a href="http://local.ru">', '</a> (10.02.2022).</li>'),
 ('<li>Three.', '<a href="https://local.ru">', '</a> (15.11.2021).</li>')]
ljmc
  • 4,830
  • 2
  • 7
  • 26