Read and Modify html using shell

Question

How can i read the html and modify the tag in it.

For example: /var/www/html/test.html has the following content:

<h2>
   test1
</h2>
<h2>
   test2
</h2>
<h2>
   test3
</h2>

I need to iterate over <h2> and add name attribute to it.

Requested result:

<h2 name="1">
  test1
</h2>
<h2 name="2">
  test2
</h2>
<h2 name="3">
  test3
</h2>

I tried :

file=/var/www/html/test.html
awk -v source_str="<h2>" -v repl_str="<h2 name=\"$count\">" '{
        gsub(source_str,repl_str)
          print
        }' $file > '/tmp/test1'
 mv '/tmp/test1' $file

Sadly, you asked for a regex to parse your HTML. [**Never** parse HTML or XML with a regex](https://stackoverflow.com/a/1732454/8344060) you might meet the pony. — kvantour, Jan 30 '19 at 14:03
Possible duplicate of [Get content between a pair of HTML tags using Bash](https://stackoverflow.com/q/21015587/608639) — jww, Jan 31 '19 at 00:36

score -1 · Answer 1 · answered Jan 30 '19 at 18:24

Using ruby with nokogiri to modify the document:

ruby -rnokogiri -e '
  h2num = 1
  document = Nokogiri::HTML.parse(open(ARGV.shift))
  document.css("h2").each do |h2|
    h2["name"] = h2num
    h2num += 1
  end
  puts document.to_html
' test.html

That takes the HTML snippet and wraps it in HTML and BODY tags to make a document:

<!DOCTYPE html PUBLIC "-//W3C//DTD HTML 4.0 Transitional//EN" "http://www.w3.org/TR/REC-html40/loose.dtd">
<html><body>
<h2 name="1">
   test1
</h2>
<h2 name="2">
   test2
</h2>
<h2 name="3">
   test3
</h2>
</body></html>

The unwanted lines can be removed by piping the output into | sed '1,2d; $d'

Read and Modify html using shell

1 Answers1