Match tag inside tag using bash

Question

I have this html

<article class="article column large-12 small-12 article--nyheter">
    <a class="article__link" href="/nyheter/14343208/">
            
        <div class="article__content">
            <h2 class="article__title t54 tm24">Person har falt ned bratt terreng - luftambulanse er på vei</h2>
        </div>
    </a>
    
</article>
<article class="article column large-6 small-6 article--nyheter">
    <a class="article__link" href="/nyheter/14341466/">
            <figure class="image image__responsive" style="padding-bottom:42.075%;">

<img class="image__img lazyload" itemprop="image" title="" alt="" src="data:image/gif;base64,R0lGODlhEAAJAIAAAP///wAAACH5BAEAAAAALAAAAAAQAAkAAAIKhI+py+0Po5yUFQA7" />

</figure>
        <div class="article__content">
            <h2 class="article__title t34 tm24">Vil styrke innsatsen mot vold i nære relasjoner</h2>
        </div>
    </a>
    
</article>

The thing is that I want to get only those html tags, in this case article tags, which has a child img tag inside them.

I have this sed command

sed -n  '/<article class.*article--nyheter/,/<\/article>/p' onlyArticlesWithOutSpace.html > test.html

Now what I am trying ti achieve is to get only those article tags which has img tag inside them.

Output I want would be this

<article class="article column large-6 small-6 article--nyheter">
    <a class="article__link" href="/nyheter/14341466/">
            <figure class="image image__responsive" style="padding-bottom:42.075%;">

<img class="image__img lazyload" itemprop="image" title="" alt="" src="data:image/gif;base64,R0lGODlhEAAJAIAAAP///wAAACH5BAEAAAAALAAAAAAQAAkAAAIKhI+py+0Po5yUFQA7" />

I cannot use any xml/html parser. Just looking to use sed, grep, awk etc.

</figure>
        <div class="article__content">
            <h2 class="article__title t34 tm24">Vil styrke innsatsen mot vold i nære relasjoner</h2>
        </div>
    </a>

</article>

`I want to get only those html tags, in this case article tags, which has a child img tag inside them`: `sed` is not the right tool for this. Use a HTML parser in `perl, python or php`. — anubhava, Nov 07 '21 at 17:00
Yes I know that but like I mentioned I cannot use anything else then sed, awk, grep — mohsinali1317, Nov 07 '21 at 17:26

F. Hauri - Give Up GitHub · Answer 1 · 2021-11-08T06:32:37.483

1

Care: parsing XML using sed is a wrong good idea!

Thanks to Cyrus's comment for pointing to good reference.

Anyway, U could try this:

sed -ne '/<article/{ :a; N; /<\/article/ ! ba ; /<img/p ; }'

edited Nov 08 '21 at 06:32

answered Nov 07 '21 at 17:13

F. Hauri - Give Up GitHub

64,122
17
116
137

Got this error sed: 2: "/
– mohsinali1317 Nov 07 '21 at 17:22
1

@mohsinali1317: With your example shown, there is no mistake. This is exactly why you don't use `sed`. Please [Don't Parse XML/HTML With Regex.](https://stackoverflow.com/a/1732454/3776858). I suggest to use an XML/HTML parser (xmlstarlet, xmllint ...). – Cyrus Nov 07 '21 at 18:31
@mohsinali1317 Care to use **single** quotes around script. This use GNU `sed`. – F. Hauri - Give Up GitHub Nov 08 '21 at 06:33

Match tag inside tag using bash

1 Answers1

Care: parsing XML using sed is a wrong good idea!