0

I'm trying to parse a large XML file to put the contents in my database. My question is simple, although I find it difficult to find a nice and clean solution.

Imagine the following XML-string:

<tag1>
    OuterText <tag2>InnerText</tag2>
</tag1>

Edit. The question is: How do I catch the OuterText in a string?

I could just remove tags of and the tags and content of using regex, but so far I've been using SimpleXML so I'd prefer an answer that goes nicely with this practice.

arothuis
  • 158
  • 10
  • possible duplicate of [XML node with Mixed Content using PHP DOM](http://stackoverflow.com/questions/598829/xml-node-with-mixed-content-using-php-dom) – cmbuckley Jul 29 '13 at 23:06
  • Although it's not quite the same requirement, you may find the answer here helpful: http://stackoverflow.com/questions/17582470/simplexml-access-seperated-text-nodes – IMSoP Jul 30 '13 at 19:51
  • @cbuckley That's about creating a document, this is about reading one. Also, that one assumes the DOM API, this assumes the SimpleXML API, although that is a more minor point since the two can be mixed freely as necessity arises. – IMSoP Jul 30 '13 at 19:52

4 Answers4

1

Okay, looks like I asked this question too fast. I messed around a bit using my own simplified example and this is what I found. It actually works, despite the malformed XML.

$xml = "<tag1>
          OuterText <tag2>InnerText</tag2>
        </tag1>"

$sxe = new SimpleXMLElement($xml);

$out = (string)$sxe;
$in = (string)$sxe->tag2;

// output:
// OuterText
// InnerText
echo "$out<br>$in";

Edit: This method will produce the following result with an XML-string with OuterText on both sides of the inline tag:

$xml = "<tag1>
          OuterText1 <tag2>InnerText</tag2> OuterText2
        </tag1>"
// output will then be:
// OuterText1 OuterText2 ($out)
// InnerText ($in)
arothuis
  • 158
  • 10
  • Another problem will arise when there is another piece of OuterText (i.e. "OuterText2") after . It will then output OuterText OuterText2
    InnerText. I think I might just use some simple regexes to solve this.
    – arothuis Jul 29 '13 at 23:27
0

Something like this should work:

$yourinput = new SimpleXMLElement($xmlstr);
foreach($yourinput->tag1 as $curtag){
    mysql_query("INSERT INTO table (field1, field2) VALUES($curtag, $curtag->tag2)");
}
ScottMcGready
  • 1,612
  • 2
  • 24
  • 33
0

If I understand the question correctly, you want all the text content of a tag, in order, but without any inner XML tags.

It's not particularly elegant, but this would theoretically do the trick:

$inner_text = strip_tags($some_simplexml_node->asXML()); 

The trick here is that SimpleXML can serialize any fragment of XML (e.g. a single node that you've found while traversing the document) back into XML; removing all tags from that should then give you all the text content, in the right order.

IMSoP
  • 89,526
  • 13
  • 117
  • 169
-1

You wont be able to use simpleXML or anything similar for this as it is not valid XML to have this text contained outside of any element. Is this intentional or an error in the XML generation(not sure where you are getting the XML from)?

  • The sourceXML of which this is a simplified version is generated by a third party, so it's an error in the XML. I just found out simpleXML can solve it after all. – arothuis Jul 29 '13 at 23:06
  • 1
    The text is not "outside any element", it is inside the `` element. It is perfectly valid in XML to have an element contain text (and CDATA) nodes alongside child elements. Text markup languages like XHTML and DocBook make heavy use of this (e.g. `

    some bold text
    and a line break

    `)
    – IMSoP Jul 31 '13 at 11:15