I'm struggling to capture inside groups, or even describe this problem. For example, a regular expression to parse markup similar to:
<section id="foo">
<title>Code about Bears</title>
<para>Words</para>
<para><emphasis>Python Code</emphasis></para>
<program language="py">import bears</program>
<para><emphasis>JavaScript Code</emphasis></para>
<program language="js">var bear = require('bears');</program>
<section id="bar">
<title>Code about Bear Cubs</title>
<para>Words</para>
<para><emphasis>Python Code</emphasis></para>
<program language="py">import cubs</program>
<para><emphasis>JavaScript Code</emphasis></para>
<program language="js">var cub = require('cubs');</program>
</section>
</section>
Ultimately I'd like to extract a particular language, so for Python:
Code about Bears: id=foo
import bears
Code about Bear Cubs: id=bar
import cubs
The difficulty is keeping <section id="bar"/>
intact as I always end up merging its contents into <section id="foo"/>
. Imagine it containing lot more nested sections/markup other than this simple example.
I've made two separate attempts.
First attempt was to only extract code, and it works (fwiw these are used in PHP's preg_match_all):
/<emphasis>(.*) Code<\/emphasis>\s*<\/para>\s*<program ?(language="(.*)")?>\s*(.*)<\/program>/msUg
But that simply extracts all code and loses the section context, both in terms of section title and id.
Second attempt was to extract sections first, but it does not work well:
/<section id="(.*)">\s*<title>(.*)<\/title>(.*)<\/section>/msUg
It matches <section id="foo">
with the second to last </section>
rather than separating out the inside section.