I have changes.html
file and want to parse it in a bash script to get the most recent (topmost) changes list:
. . .
<h1> Changes </h1>
<h2>
<a href="3/changes">#3</a>
</h2>
<ol>
<li>Recent Text line 1</li>
<li>Recent Text line 2</li>
</ol>
<h2>
<a href="2/changes">#2</a>
</h2>
<ol>
<li>Text line 1</li>
<li>Text line 2</li>
<li>Text line 3</li>
</ol>
<h2>
<a href="1/changes">#1</a>
</h2>
<ol>
<li>Text line 1</li>
<li>Text line 2</li>
</ol>
. . .
Expected output:
Recent Text line 1
Recent Text line 2
How can I do this in a bash script?
I've been trying bash regexp, but I'm definitely doing something wrong
changes_regex='(<ol><li>.*</li></ol>)?'
changes_list=$(< ~/Documents/outfile.html)
if [[ $changes_list =~ $changes_regex ]]; then
echo 'match'
n=${#BASH_REMATCH[*]}
while [[ $i -lt $n ]]; do
echo " capture[$i]: ${BASH_REMATCH[$i]}"
let i++
done
else
echo 'no match'
fi
The above script returns only:
match
capture[]:
capture[1]:
If I remove brackets in regexp (changes_regex='<ol><li>.*</li></ol>'
), I get greedy match.
How to correctly build regular expression to lazily get only the first list contents?