0

A html document can have multiple tags like below:

<h2>
   <a id="id1" name="name1"></a>
   test1
</h2>
<h2>
   <a id="id2" name="name2"></a>
   test2
</h2>

I am iterating over all <h2> tag in document to get the inner html of <h2> using awk like below:

file='/var/www/html/test.html' 

awk -F" *</?h2> *\n?" -v RS="^$" '{
for(i=2;i<=NF;i+=2)
{
   printf "%s", $i       
   //parse to get the 'id' and 'text'
   arr['id']=value //need to do something here
}
}' $file

and i am getting output like:

<a id="id1" name="name1"></a>
 test1
<a id="id2" name="name2"></a>
 test2

Now, i want to parse the anchor inside awk loop to get the id as key and description(for ex: test1) as value.

So, that if i access array as ${arr[@]} outside the awk i should get the below output something like:

{'id1':'test1','id2':'test2'}
RavinderSingh13
  • 130,504
  • 14
  • 57
  • 93
Vishnu Sharma
  • 632
  • 5
  • 19
  • Sorry not clear is it output should be in exact same form `{'id1':'test1','id2':'test2'}`? Or it should be in a bash array form? – RavinderSingh13 Dec 04 '18 at 07:06
  • yes bash array is fine. – Vishnu Sharma Dec 04 '18 at 07:08
  • 1
    [Don't Parse XML/HTML With Regex.](https://stackoverflow.com/a/1732454/3776858) I suggest to use an XML/HTML parser (xmlstarlet, xmllint ...). – Cyrus Dec 04 '18 at 07:14
  • Cross posting is generally not recommended. You've just posted this on Unix.SE [How to extract html between tags?](https://unix.stackexchange.com/q/485828/112235) – Inian Dec 04 '18 at 07:27

0 Answers0