0

I have a LARGE XML file. I'm troubleshooting some things, and I would like to extract specific nodes from the XML file. I don't want a SimpleXML object, I want to make a new file with the raw string matching what I want (posting this on bash/sed/php).

<?xml version="1.0" encoding="UTF-8"?>
<definition></definition>
    <metadata></metadata>
    <nodeToRegex>
        <nodeImightwant>
            <subnode>
                <subsubnode1></subsubnode1>
                <subsubnodeToCheck>stringCheck</subnodeToCheck>
                <subsubnode2></subsubnode2>
            </subnode>
        </nodeImightwant>
        <nodeImightwant></nodeImightwant>
        <nodeImightwant></nodeImightwant>
    </nodeToRegex>

So from this XML file, I want all lines from every node except the nodeToRegex. From nodeToRegex, I only want the nodeImightwant if the stringCheck string equals "aValidString". Can this be done via regex or should I just copy and paste the stuff out of the file? (my regex skills are subpar)

  • 1
    For a number of reasons (look up *Cthulhu regex*, for example), using regex to parse XML is simply not a good idea. It's unmaintainable and gets out of hand quickly. You're better of using one of the numerous well-tested XML parsing solutions readily available out there. – Etheryte Feb 13 '14 at 18:54

1 Answers1

1

Don't parse XML with regexes. There is no reason you can't repackage/rearrange the data using SimpleXML, but trying to do it with a regex is a recipe for lots of headaches and, ultimately, broken code.

See this classic example for why parsing XML/HTML/XHTML with regexes is the road to madness.

If you insist on using a regex, just replace the nodes you don't want, like this:

$myxml = preg_replace('~<nodeToRegex>.*?</nodeToRegex>~', '', $myxml);

Regular expression visualization

Debuggex Demo

Community
  • 1
  • 1
elixenide
  • 44,308
  • 16
  • 74
  • 100
  • I ended up reading the file and looping through the nodes until I extracted just the ones I was interested in. I'm not actually parsing with RegEx, I just needed to extract these nodes to get a working importer (all other nodes currently import fine). And I did my section of the importer with SimpleXML, FWIW – user3258505 Feb 14 '14 at 00:06