I'm running into trouble on how to use a working PCRE regex in sed. I found a related topic, but unfortunately it's not working for me (I'm working on Linux not MacOS X if that should be necessary information). I have some HTML code of which I need a special part - however not between the same tags. The regex I have is working according to some regex testing sites (like regex101 or regexr.com), however, when trying to use it on sed, it shows me the whole file instead of the wanted part.
My regex is:
/((<div id="main-content" class="wiki-content">)([\w\d\s\S]*))(<\/rdf:RDF>\n-->)/g
It grabs a part starting with the specific div
and collects everything including the following RDF part.
The text I'm working with looks (with the interesting part - the rest around I cut due to length and also this part only appears once the file) as follows (inlcuding the gaps):
[...]
a href="#page-metadata-start" class="assistive">Go to start of metadata</a>
<div id="page-metadata-end" class="assistive"></div>
<div id="main-content" class="wiki-content">
<p><br/></p><p><br/></p><div class="panel conf-macro output-block" data-hasbody="true" data-macro-name="panel" style="border-color: #004237;border-width: 1px;"><div class="panelHeader" style="border-bottom-width: 1px;border-bottom-color: #004237;background-color: #004237;color: white;"><b>Inhalt</b></div><div class="panelContent">
<p> </p><div class="toc-macro client-side-toc-macro conf-macro output-block" data-hasbody="false" data-headerelements="H1,H2,H3,H4,H5,H6,H7" data-macro-name="toc"> </div><p> </p>
</div></div><h1 id="id-01-Dokumentation-1EinstiegindiePlanung">1 Einstieg in die Planung</h1><p><br/></p><h2 id="id-01-Dokumentation-1.1Startseite">1.1 Startseite</h2><p>Nach der Anmeldung im System findet sich der User auf der Startseite wieder. Von hier aus gelangt er zur "Planning Map".</p><p>Durch das Umschalten der Company auf die Counter Company kann der Planer die zuvor eingetragenen Werte kontrollieren. Diese erscheinen nach dem Umschalten in der zweiten<br/>Tabelle als negativer Wert auf dem gemappten IC Account.</p><p><br/></p><div class="table-wrap"><table class="wrapped confluenceTable"><colgroup><col/><col/><col/></colgroup><tbody><tr><th class="confluenceTh">Button</th><th class="confluenceTh">Aktion</th><th class="confluenceTh">Beschreibung</th></tr><tr><p>Alle Eintragungen werden auf der untersten Ebene (weißer Hintergrund) ausgeführt. Monate, die nicht mehr beplant werden können sind farblich hinterlegt. Ebenfalls farblich hinterlegt sind<br/>die Accounts, die hier nicht beplant werden können (IC / Capex und Rule belegte Accounts).</p><p><br/></p><div class="table-wrap"><table class="wrapped confluenceTable"><colgroup><col/><col/><col/></colgroup><tbody><tr><th class="confluenceTh">Button</th><th class="confluenceTh">Aktion</th><th class="confluenceTh">Beschreibung</th></tr><tr><td colspan="1" class="confluenceTd">Back</td><td colspan="1" class="confluenceTd">Sheet Wechsel</td><td colspan="1" class="confluenceTd">Zurück zum Sheet Planning Map</td></tr><tr><td colspan="1" class="confluenceTd">Refresh</td><td colspan="1" class="confluenceTd">Prozess ausführen</td><td colspan="1" class="confluenceTd">Sheet wird nochmals neu aufgebaut</td></tr></tbody></table></div>
</div>
<!--
<rdf:RDF xmlns:rdf="http://www.w3.org/1999/02/22-rdf-syntax-ns#"
xmlns:dc="http://purl.org/dc/elements/1.1/"
xmlns:trackback="http://madskills.com/public/xml/rss/module/trackback/">
<rdf:Description
rdf:about="https://confluence.example.org/confluence/display/KUN/01+-+Dokumentation"
dc:identifier="https://confluence.example.org/confluence/display/KUN/01+-+Dokumentation"
dc:title="01 - Dokumentation"
trackback:ping="https://confluence.example.org/confluence/rpc/trackback/47022143"/>
</rdf:RDF>
-->
[...]
So when I try this regex on the mentioned test websites, it marks the part I need (between <div id="main-content" class="wiki-content">
and </rdf:RDF>\n-->
).
But when using sed -r '/((<div id="main-content" class="wiki-content">)([\w\d\s\S]*))(<\/rdf:RDF>\n-->)/g' testfile.txt
it shows me the complete file content instead only the part I'm looking for (sed -E...
produces the same).
I can't work out, where my problem is, so any help would be very much appreciated. Also, I'm not a professional regex user...