I'm trying to extract the contents of an HTML list using awk. Some list entries are multi-line.
Example input list:
<ul>
<li>
<b>2021-07-21:</b> Lorem ipsum
</li>
<li>
<b>2021-07-19:</b> Lorem ipsum
</li>
<li><b>2021-07-10:</b> Lorem ipsum</li>
</ul>
Command I'm using:
awk -v RS="" '{match($0, /<li>(.+)<\/li>/, entry); print entry[1]}' file.html
Current output:
<b>2021-07-21:</b> Lorem ipsum
</li>
<li>
<b>2021-07-19:</b> Lorem ipsum
</li>
<li><b>2021-07-10:</b> Lorem ipsum
Desired output:
<b>2021-07-21:</b> Lorem ipsum
<b>2021-07-19:</b> Lorem ipsum
<b>2021-07-10:</b> Lorem ipsum
I know the issue is because the list entries are not separated by empty lines. I thought of using non-greedy matching, but apparently Awk doesn't support it. Is there a possible workaround?