0

I'm struggling to capture inside groups, or even describe this problem. For example, a regular expression to parse markup similar to:

<section id="foo">
  <title>Code about Bears</title>

  <para>Words</para>

  <para><emphasis>Python Code</emphasis></para>
  <program language="py">import bears</program>

  <para><emphasis>JavaScript Code</emphasis></para>
  <program language="js">var bear = require('bears');</program>

  <section id="bar">
    <title>Code about Bear Cubs</title>

    <para>Words</para>

    <para><emphasis>Python Code</emphasis></para>
    <program language="py">import cubs</program>

    <para><emphasis>JavaScript Code</emphasis></para>
    <program language="js">var cub = require('cubs');</program>
  </section>
</section>

Ultimately I'd like to extract a particular language, so for Python:

Code about Bears: id=foo
  import bears

Code about Bear Cubs: id=bar
  import cubs

The difficulty is keeping <section id="bar"/> intact as I always end up merging its contents into <section id="foo"/>. Imagine it containing lot more nested sections/markup other than this simple example.

I've made two separate attempts.

First attempt was to only extract code, and it works (fwiw these are used in PHP's preg_match_all):

/<emphasis>(.*) Code<\/emphasis>\s*<\/para>\s*<program ?(language="(.*)")?>\s*(.*)<\/program>/msUg

But that simply extracts all code and loses the section context, both in terms of section title and id.

Second attempt was to extract sections first, but it does not work well:

/<section id="(.*)">\s*<title>(.*)<\/title>(.*)<\/section>/msUg

It matches <section id="foo"> with the second to last </section> rather than separating out the inside section.

LookingToLearn
  • 303
  • 3
  • 8
  • 3
    [*Don't use regex for this*](https://stackoverflow.com/questions/1732348/regex-match-open-tags-except-xhtml-self-contained-tags), use an XML Parser. – Spencer Wieczorek Oct 18 '17 at 18:24
  • Find `[\u0020\u0043\u0045\u0048\u004D\u004F\u0053\u0300-\u0301\u0305\u030E-\u030F\u0311-\u0312\u0314\u0316\u0319-\u031A\u031D-\u031F\u0321\u0325\u0327-\u032F\u0331-\u0332\u0334\u0336-\u0339\u033D-\u0340\u0344-\u0345\u034A\u034C-\u034F\u0356\u0358-\u0359\u035B-\u035D\u035F\u0365\u0367-\u036A\u036C-\u036F\u2013]` replace "" –  Oct 18 '17 at 18:42
  • TIL that this is a thing here on StackOverflow :) – LookingToLearn Oct 18 '17 at 19:09

0 Answers0