0

I have to parse many documents xml like this:

<doc id=lk-20130223040102_592>
<meta-info>
<tag name="date">2013-02-22</tag>
<tag name="source-encoding">ISO-8859-1</tag>
</meta-info>
<text><SE><E type="E:PERSON">Tom Taylor</E>, who runs <E type="E:ORGANIZATION:CORPORATION">MF&B Marine Warehouse</E> in <E type="E:LOCATION:OTHER">Hampton Roads</E>, is already watching contracts with the <E type="E:ORGANIZATION:GOVERNMENT">Navy</E> <E type="E:PER_DESC">dry</E> up at his small ship-repair <E type="E:ORG_DESC:CORPORATION">business</E>.</SE>
</text></doc>
<doc ...</doc>

I made a simple script to parse one of these:

<?php
$xml=simplexml_load_file('wp7-lk-20130223040102.xml');
foreach ($xml->doc as $doc){
    echo $doc['id'];
    echo "<br>";
}
?>

but it will return a set of warning like this:

Warning: simplexml_load_file(): ^ in C:\wamp\www\parse_xml.php on line 6

I noticed some errors (id = ... rather than id = "...") (parent element is missing) and I corrected what I could, but there are also many others.

Is there any function to help me to correct errors automatically xml?

K0pp0
  • 275
  • 1
  • 3
  • 8

1 Answers1

1

This is a non-php solution, but could be part of the process (and even automated via php). For many years I've relied on an app called "tidy" to quick fix HTML, XML. Might not work or might make things worse; its just a suggestion.

tidy -xml yourfile.xml > output.xml

I've had good luck with it. YMMV.

Your question is similar to Fix malformed XML in PHP before processing using DOMDocument functions which suggests Tidy php-extension

old tidy link: http://www.w3.org/People/Raggett/tidy/

Community
  • 1
  • 1
Tom Pimienta
  • 109
  • 3
  • Not sure this is relevant to either XML or to the OP's problem, but a new, HTML 5-aware incarnation of tidy is available [here](http://www.htacg.org/tidy-html5/). – Sato Katsura Jun 28 '15 at 19:00