0

I'm running into trouble on how to use a working PCRE regex in sed. I found a related topic, but unfortunately it's not working for me (I'm working on Linux not MacOS X if that should be necessary information). I have some HTML code of which I need a special part - however not between the same tags. The regex I have is working according to some regex testing sites (like regex101 or regexr.com), however, when trying to use it on sed, it shows me the whole file instead of the wanted part.

My regex is:

/((<div id="main-content" class="wiki-content">)([\w\d\s\S]*))(<\/rdf:RDF>\n-->)/g

It grabs a part starting with the specific div and collects everything including the following RDF part.

The text I'm working with looks (with the interesting part - the rest around I cut due to length and also this part only appears once the file) as follows (inlcuding the gaps):

[...]
a href="#page-metadata-start" class="assistive">Go to start of metadata</a>
<div id="page-metadata-end" class="assistive"></div>



        <div id="main-content" class="wiki-content">

        <p><br/></p><p><br/></p><div class="panel conf-macro output-block" data-hasbody="true" data-macro-name="panel" style="border-color: #004237;border-width: 1px;"><div class="panelHeader" style="border-bottom-width: 1px;border-bottom-color: #004237;background-color: #004237;color: white;"><b>Inhalt</b></div><div class="panelContent">
<p> </p><div class="toc-macro client-side-toc-macro  conf-macro output-block" data-hasbody="false" data-headerelements="H1,H2,H3,H4,H5,H6,H7" data-macro-name="toc"> </div><p> </p>
</div></div><h1 id="id-01-Dokumentation-1EinstiegindiePlanung">1 Einstieg in die Planung</h1><p><br/></p><h2 id="id-01-Dokumentation-1.1Startseite">1.1 Startseite</h2><p>Nach der Anmeldung im System findet sich der User auf der Startseite wieder. Von hier aus gelangt er zur &quot;Planning Map&quot;.</p><p>Durch das Umschalten der Company auf die Counter Company kann der Planer die zuvor eingetragenen Werte kontrollieren. Diese erscheinen nach dem Umschalten in der zweiten<br/>Tabelle als negativer Wert auf dem gemappten IC Account.</p><p><br/></p><div class="table-wrap"><table class="wrapped confluenceTable"><colgroup><col/><col/><col/></colgroup><tbody><tr><th class="confluenceTh">Button</th><th class="confluenceTh">Aktion</th><th class="confluenceTh">Beschreibung</th></tr><tr><p>Alle Eintragungen werden auf der untersten Ebene (weißer Hintergrund) ausgeführt. Monate, die nicht mehr beplant werden können sind farblich hinterlegt. Ebenfalls farblich hinterlegt sind<br/>die Accounts, die hier nicht beplant werden können (IC / Capex und Rule belegte Accounts).</p><p><br/></p><div class="table-wrap"><table class="wrapped confluenceTable"><colgroup><col/><col/><col/></colgroup><tbody><tr><th class="confluenceTh">Button</th><th class="confluenceTh">Aktion</th><th class="confluenceTh">Beschreibung</th></tr><tr><td colspan="1" class="confluenceTd">Back</td><td colspan="1" class="confluenceTd">Sheet Wechsel</td><td colspan="1" class="confluenceTd">Zurück zum Sheet Planning Map</td></tr><tr><td colspan="1" class="confluenceTd">Refresh</td><td colspan="1" class="confluenceTd">Prozess ausführen</td><td colspan="1" class="confluenceTd">Sheet wird nochmals neu aufgebaut</td></tr></tbody></table></div>




        </div>

        <!--
<rdf:RDF xmlns:rdf="http://www.w3.org/1999/02/22-rdf-syntax-ns#"
         xmlns:dc="http://purl.org/dc/elements/1.1/"
         xmlns:trackback="http://madskills.com/public/xml/rss/module/trackback/">
         <rdf:Description
    rdf:about="https://confluence.example.org/confluence/display/KUN/01+-+Dokumentation"
    dc:identifier="https://confluence.example.org/confluence/display/KUN/01+-+Dokumentation"
    dc:title="01 - Dokumentation"
    trackback:ping="https://confluence.example.org/confluence/rpc/trackback/47022143"/>
</rdf:RDF>
-->
[...]

So when I try this regex on the mentioned test websites, it marks the part I need (between <div id="main-content" class="wiki-content"> and </rdf:RDF>\n-->).

But when using sed -r '/((<div id="main-content" class="wiki-content">)([\w\d\s\S]*))(<\/rdf:RDF>\n-->)/g' testfile.txt it shows me the complete file content instead only the part I'm looking for (sed -E... produces the same).

I can't work out, where my problem is, so any help would be very much appreciated. Also, I'm not a professional regex user...

1 Answers1

0

I'm not sure about how it will work in your case, as I need probably some more data, but this one could work for you:

sed -ne '/<div id="main-content" class="wiki-content">/,/-->/{p}' file.html
Maxim Norin
  • 1,343
  • 2
  • 9
  • 12
  • That's it - hits it right on the spot. Thank you very much! – qualisartifex Sep 25 '17 at 19:47
  • Sorry, but I was fiddling around with other RegEx, too, so I didn't see the comment. Sure, it would be nice if you could explain to me, how it is working :-) – qualisartifex Oct 07 '17 at 14:54
  • Flag '-n' tells sed to print only processed lines, `/
    /,/-->/` takes all lines between `
    ` and `-->`, including these lines. It's like find first line, then find second line and then take everything in between including these lines. And `{p}` just prints all these lines.
    – Maxim Norin Oct 08 '17 at 17:43
  • Thank you for the explanation - really appreciate it. – qualisartifex Oct 10 '17 at 07:33