Bash regex - how to lazily parse a list from HTML file

Question

I have changes.html file and want to parse it in a bash script to get the most recent (topmost) changes list:

    . . .
<h1>        Changes        </h1>
<h2>
  <a href="3/changes">#3</a>
</h2>
<ol>
  <li>Recent Text line 1</li>
  <li>Recent Text line 2</li>
</ol>
<h2>
  <a href="2/changes">#2</a>
</h2>
<ol>
  <li>Text line 1</li>
  <li>Text line 2</li>
  <li>Text line 3</li>
</ol>
<h2>
  <a href="1/changes">#1</a>
</h2>
<ol>
  <li>Text line 1</li>
  <li>Text line 2</li>
</ol>
. . .

Expected output:

Recent Text line 1
Recent Text line 2

How can I do this in a bash script?

I've been trying bash regexp, but I'm definitely doing something wrong

changes_regex='(<ol><li>.*</li></ol>)?'
changes_list=$(< ~/Documents/outfile.html)

if [[ $changes_list =~ $changes_regex ]]; then
  echo 'match'
  n=${#BASH_REMATCH[*]}
  while [[ $i -lt $n ]]; do
      echo "  capture[$i]: ${BASH_REMATCH[$i]}"
      let i++
  done
else
  echo 'no match'
fi

The above script returns only:

match
  capture[]:
  capture[1]:

If I remove brackets in regexp (changes_regex='<ol><li>.*</li></ol>'), I get greedy match.

How to correctly build regular expression to lazily get only the first list contents?

Don't parse HTML with regex. Use syntax aware tools. Provide the complete HTML file with proper syntax — Inian, Jan 16 '18 at 11:57
[Parsing HTML or XML via RegEx](https://stackoverflow.com/a/1732454/952747).. I'm not sure it's a good idea. Use a xpath utility. — masoud, Jan 16 '18 at 11:57

score 1 · Accepted Answer · answered Jan 16 '18 at 12:05

1

sed -n '/<ol>/,/<\/ol>/p; /<\/ol>/q' changes.html | sed -r 's/<li>(.*)<\/li>/\1/g;s/<.*//g'

Output (6th, 7th line):

  Recent Text line 1
  Recent Text line 2

I've understood you correctly?

answered Jan 16 '18 at 12:05

Viktor Khilin

1,760
9
21

This works, thanks a lot! I'll use it as a template. However, I'll probably check out other tools not to use regex for parsing html as suggested – Alexander Zhak Jan 16 '18 at 12:17

score 1 · Answer 2 · answered Jan 16 '18 at 18:19

1

Using xmllint and XPath to parse the html

xmllint --html --xpath '//h2[a[@href="3/changes"]]/following-sibling::ol[1]/li' first.html | sed -re 's/<li>([a-zA-Z0-9 ]+)<\/li>/\1\n/g'
Recent Text line 1
Recent Text line 2

answered Jan 16 '18 at 18:19

LMC

10,453
2
27
52

Bash regex - how to lazily parse a list from HTML file

2 Answers2