0

I have changes.html file and want to parse it in a bash script to get the most recent (topmost) changes list:

    . . .
<h1>        Changes        </h1>
<h2>
  <a href="3/changes">#3</a>
</h2>
<ol>
  <li>Recent Text line 1</li>
  <li>Recent Text line 2</li>
</ol>
<h2>
  <a href="2/changes">#2</a>
</h2>
<ol>
  <li>Text line 1</li>
  <li>Text line 2</li>
  <li>Text line 3</li>
</ol>
<h2>
  <a href="1/changes">#1</a>
</h2>
<ol>
  <li>Text line 1</li>
  <li>Text line 2</li>
</ol>
. . .

Expected output:

Recent Text line 1
Recent Text line 2

How can I do this in a bash script?

I've been trying bash regexp, but I'm definitely doing something wrong

changes_regex='(<ol><li>.*</li></ol>)?'
changes_list=$(< ~/Documents/outfile.html)

if [[ $changes_list =~ $changes_regex ]]; then
  echo 'match'
  n=${#BASH_REMATCH[*]}
  while [[ $i -lt $n ]]; do
      echo "  capture[$i]: ${BASH_REMATCH[$i]}"
      let i++
  done
else
  echo 'no match'
fi

The above script returns only:

match
  capture[]:
  capture[1]:

If I remove brackets in regexp (changes_regex='<ol><li>.*</li></ol>'), I get greedy match.

How to correctly build regular expression to lazily get only the first list contents?

Alexander Zhak
  • 9,140
  • 4
  • 46
  • 72

2 Answers2

1
sed -n '/<ol>/,/<\/ol>/p; /<\/ol>/q' changes.html | sed -r 's/<li>(.*)<\/li>/\1/g;s/<.*//g'

Output (6th, 7th line):

  Recent Text line 1
  Recent Text line 2

I've understood you correctly?

Viktor Khilin
  • 1,760
  • 9
  • 21
  • This works, thanks a lot! I'll use it as a template. However, I'll probably check out other tools not to use regex for parsing html as suggested – Alexander Zhak Jan 16 '18 at 12:17
1

Using xmllint and XPath to parse the html

xmllint --html --xpath '//h2[a[@href="3/changes"]]/following-sibling::ol[1]/li' first.html | sed -re 's/<li>([a-zA-Z0-9 ]+)<\/li>/\1\n/g'
Recent Text line 1
Recent Text line 2
LMC
  • 10,453
  • 2
  • 27
  • 52