I've crafted a series of regex statements using sed
in bash to parse HTML. I'm aware this isn't recommended, but to be honest, this is a temporary fix and I'm not looking to do anything (too) complicated.
Anytime this pattern is matched:
<div class="section1-header">
<div class="section1-number">GROUP 1 ARBITRARY CONTENT</div>
<div class="section1-title">GROUP 2 ARBITRARY CONTENT</div></div>
It should be replaced with:
<h2>GROUP 1 ARBITRARY CONTENT - GROUP 2 ARBITRARY CONTENT</h2>
And this is repeated for section[1-3]-header, with h[2-4] tags.
sed -Ei 's/[<]div class=\"section1-header\"[>][<]div class=\"section1-number\"[>](.*?)[<]\/div[>][<]div class=\"section1-title\"[>](.*?)[<]\/div[>][<]\/div[>]/<h2>\1 - \2<\/h2>/g' ${1}
sed -Ei 's/[<]div class=\"section2-header\"[>][<]div class=\"section2-number\"[>](.*?)[<]\/div[>][<]div class=\"section2-title\"[>](.*?)[<]\/div[>][<]\/div[>]/<h3>\1 - \2<\/h3>/g' ${1}
sed -Ei 's/[<]div class=\"section3-header\"[>][<]div class=\"section3-number\"[>](.*?)[<]\/div[>][<]div class=\"section3-title\"[>](.*?)[<]\/div[>][<]\/div[>]/<h4>\1 - \2<\/h4>/g' ${1}
Testing my regex online using various sites, every single instance I need to be hit is matched correctly, without any additional content grabbed. When actually executing it, it seems at random it'll grab more than is necessary (even though the regex tester matched the correct sequence of characters, lazy-style).
Before with sample content:
<div class="section1-title">Archive Get Command (<span class="id">archive-get</span>)</div></div><div class="section-intro">WAL segments are required for restoring a <span class="postgres">PostgreSQL</span> cluster or maintaining a replica.</div><div class="section-body"><div class="section2"><a id="command-archive-get/category-command"></a><div class="section2-header"><div class="section2-number">2.1</div><div class="section2-title">Command Options</div></div><div class="section-body"><div class="section3"><a id="command-archive-get/category-command/option-archive-async"></a><div class="section3-header"><div class="section3-number">2.1.1</div><div class="section3-title">Asynchronous Archiving Option (<span class="id">--archive-async</span>)</div>
After with sample content:
<h2>2 - Archive Get Command (<span class="id">archive-get</span>)</div></div><div class="section-intro">WAL segments are required for restoring a <span class="postgres">PostgreSQL</span> cluster or maintaining a replica.</div><div class="section-body"><div class="section2"><a id="command-archive-get/category-command"></a><div class="section2-header"><div class="section2-number">2.1</div><div class="section2-title">Command Options</div></div><div class="section-body"><div class="section3"><a id="command-archive-get/category-command/option-archive-async"></a><div class="section3-header"><div class="section3-number">2.1.1</div><div class="section3-title">Asynchronous Archiving Option (<span class="id">--archive-async</span>)</h2>
If you look carefully, you'll notice <h2>
is substituted correctly over that 2 - Archive Get Command
but it does not correctly substitute </div></div>
with </h2>
and instead throws in the </h2>
after Asynchronous Archiving Option (<span class="id">--archive-async</span>)
.
At this point I'm thinking this might be some kind of multi-line processing issue with sed, but am stuck in the troubleshooting stage and am unsure where to go from here.
\1 - \2<\/h2>/g'
– Hossein.Kiani Nov 02 '18 at 07:54