Parse html and keep into associate array in awk

Asked Dec 04 '18 at 06:59

Active Dec 04 '18 at 07:53

Viewed 45 times

A html document can have multiple tags like below:

<h2>
   <a id="id1" name="name1"></a>
   test1
</h2>
<h2>
   <a id="id2" name="name2"></a>
   test2
</h2>

I am iterating over all <h2> tag in document to get the inner html of <h2> using awk like below:

file='/var/www/html/test.html' 

awk -F" *</?h2> *\n?" -v RS="^$" '{
for(i=2;i<=NF;i+=2)
{
   printf "%s", $i       
   //parse to get the 'id' and 'text'
   arr['id']=value //need to do something here
}
}' $file

and i am getting output like:

<a id="id1" name="name1"></a>
 test1
<a id="id2" name="name2"></a>
 test2

Now, i want to parse the anchor inside awk loop to get the id as key and description(for ex: test1) as value.

So, that if i access array as ${arr[@]} outside the awk i should get the below output something like:

{'id1':'test1','id2':'test2'}

edited Dec 04 '18 at 07:53

RavinderSingh13

130,504
14
57
93

asked Dec 04 '18 at 06:59

Vishnu Sharma

Sorry not clear is it output should be in exact same form `{'id1':'test1','id2':'test2'}`? Or it should be in a bash array form? – RavinderSingh13 Dec 04 '18 at 07:06
yes bash array is fine. – Vishnu Sharma Dec 04 '18 at 07:08
1

[Don't Parse XML/HTML With Regex.](https://stackoverflow.com/a/1732454/3776858) I suggest to use an XML/HTML parser (xmlstarlet, xmllint ...). – Cyrus Dec 04 '18 at 07:14
Cross posting is generally not recommended. You've just posted this on Unix.SE [How to extract html between tags?](https://unix.stackexchange.com/q/485828/112235) – Inian Dec 04 '18 at 07:27

Parse html and keep into associate array in awk

0 Answers0