1

I need a regex that is matching the content of the <cherry> tag which is not part of another tag. Unsatisfied I can't use the PHP DOM Parser because the content of the tag includes sometimes very special chars.

This is an example of the incoming input:

<cherry>test</cherry>
<banana>
    <cherry>test</cherry>
    some text
</banana>

This is my current regex but it will also match to the <cherry> tag inside the <banana> tag

 (<cherry>)(.*?)(<\/cherry>)

How can I exclude the occurrence in other tags?

I have already tried a lot...

Jason Aller
  • 3,541
  • 28
  • 38
  • 38
BeDoke
  • 11
  • 1
  • Do you want to get the topmost `cherry` tags? Use DOM, it will be way easier. – Wiktor Stribiżew Sep 06 '17 at 07:57
  • 2
    Check [the famous regex answer](https://stackoverflow.com/a/1732454/851432) – Jomoos Sep 06 '17 at 08:02
  • Hey Wiktor, Thx for your reply. I need the content of the first tag which is not a part of another tag like the tag. I cant use the dom parser cause I have a lot of specialchars in the complete string. – BeDoke Sep 06 '17 at 08:02
  • Why are special characters a problem when you use the DOM, what are these characters? Could you show a sample of your real document? – Casimir et Hippolyte Sep 06 '17 at 08:11
  • possible content could look like this: A = \bigl\{ (x, y) \in \R \times \R \ | \ x^2 + y^2 < 1 \bigr\} B = \bigl\{ (x, y) \in \R \times \R \ | \ (x-a)^2 + (y-b)^2 < 1 \bigr\} \qquad (a, b \in \R) Für welche Werte von a und b sind die Mengen '''A''' und '''B''' [[Durchschnittsmenge|disjunkt]]? '''Hinweis:''' Betrachten Sie das Problem geometrisch. – BeDoke Sep 06 '17 at 08:31
  • do you need any structure from the document or just the contents? – Jakumi Sep 06 '17 at 08:56

2 Answers2

2

Why don't you make use of the DOMDocument class rather than a regex. Simply load your DOM and then use getElementsByTagName to get your tags. This way you can exclude any other tags which you don't want and only get those that you do.

Example

<?php
$xml = <<< XML
<?xml version="1.0" encoding="utf-8"?>
<books>
 <book>Patterns of Enterprise Application Architecture</book>
 <book>Design Patterns: Elements of Reusable Software Design</book>
 <book>Clean Code</book>
</books>
XML;

$dom = new DOMDocument;
$dom->loadXML($xml);
$books = $dom->getElementsByTagName('book');
foreach ($books as $book) {
    echo $book->nodeValue, PHP_EOL;
}
?>

Reading Material

DOMDocument

Script47
  • 14,230
  • 4
  • 45
  • 66
0

Under the assumption, that you just need the contents of math tags at top level without anything else and you so far can't do it, because math tags contain invalid xml and therefore any xml-parser gives up ... (as mentioned in question and comments)

The clean approach would probably be, to use some fault-tolerant xml-parser (or fault-tolerant mode) or Tidy up the input before. However, these approaches all might "corrupt" the content.

The hacky and possibly dirty approach would be the following, which might very well have other issues, especially if the remaining xml is also invalid or your math tags are nested (this will lead to the xml-parser failing in step 2):

  1. replace any <math>.*</math> (ungreedy) by a placeholder (preferably something unique uniqid might help, but a simple counter is probably enough) via preg_replace_callback or something
  2. parse the document with a common xml-parser (wrapping it in some root tag as necessary)
  3. fetch all child nodes of root node / all root nodes, see which ones were generated in step 1.

for example:

<math>some invalid xml</math>
<sometag>
    <math>more invalid xml</math>
    some text
</sometag>

replace with

$replacements = [];
$newcontent = preg_replace_callback(
       '/'.preg_quote('<math>','/').'(.*)'.preg_quote('</math>','/').'/siU',  
       function($hit) use ($replacements) { 
           $id = uniqid();
           $replacements[$id] = $hit[1];
           return '<math id="'.$id.'" />';
       },
       $originalcontent);

which will turn your content into:

<math id="1stuniqid" />
<sometag>
    <math id="2nduniqid" />
    some text
</sometag>

now use the xml parser of your choice and select all root level/base level elements and look for /math/@id (my XPath is possibly just wrong, adjust as needed). result should contain all uniqids, which you can look up in your replacement array

edit: some preg_quote problems fixed and used more standard delimiters.

Jakumi
  • 8,043
  • 2
  • 15
  • 32