1

Is there a way to do a 'simple' check if a XML file has a valid syntax? I'm using PHP's XMLReader.

I'm in this situation: I have multiple XML files that change a lot. So I can't do a XMLReader::isValid() check with a DTD file. But this it not needed persé. I only want to know if the syntax is OK. Because sometimes a XML file is corrupted for example at the end. I would like to check this, before iterating over the nodes.

The other thing is that some files are over 2GB in size, so I can't do a simple DOM check without using heavy memory.

What should I do?

Of course I tried options like suggested in the comments and this works great, but only for small files:

        $dom = new DOMDocument;
        if(!@$dom->load('example.xml')){ die("syntax error"); }

Larger files eat up all the memory and crash.

When I open a large XML file in a simple XML reader program like "firstobject XML editor", it shows me the syntax error line within milliseconds (30GB xml file it takes 1.7 seconds to show the line with syntax error). Something like this should be possible for PHP XMLReader I guess?

Edit: For the moment I will use the option above, but do a filesize check first. If below a certain size (still testing what the max size is) the syntax is checked. For the bigger files I will build an option as @IMSoP suggested below with a third party tool and command line check. I will update this if I find a stable solution for this.

Edit2 The idea of Progman (answer below) is the best till now I've seen. The only thing is that it will iterate the entire XML file. So processing takes already quite some time, this will double now. I was hoping for a quick validation option, but maybe this is not even possible. I wait a little bit to see if there are any other options, but else I think I should accept Progman answer as the best option for large files.

Edit 3: solution Alright, I just finetuned Progman's solution to use it without the set_error_handler option. Because I'm already using that for custom error handling, so what fits best for me is to suppress the errors by setting the libxml_use_internal_errors(true) flag and later check the errors like this, short example:

libxml_use_internal_errors(true);
$xml = new XMLReader();
$xml->open("large.xml");
while($xml->read());
foreach (libxml_get_errors() as $error) {
    print_r($error);
}
  • does this help you? https://stackoverflow.com/a/37422022/7740139 – mahbad Nov 24 '21 at 19:39
  • What is the problem you have with XMLReader? Please [edit] your question to include your attempts you have tried and the problems/error messages you get from your attempts. – Progman Nov 24 '21 at 19:40
  • Thanks @mahbad and Progman, I edited the question. This does not work with (very) large files unfortunately. –  Nov 24 '21 at 20:28
  • @RobbertRenolds Your source code in the question shows `DOMDocument`, but you are talking about `XMLReader`. Have you tried using `XMLReader`? What is the problem/result from your attempt in using `XMLReader`? – Progman Nov 24 '21 at 21:40
  • You could perhaps use `shell_exec` to run an external validator, such as [xmlwf](https://linux.die.net/man/1/xmlwf). I'm not posting it as an answer, because I'm not sure how well it deals with large files, or how easy it would be to parse the output (the man page points out that it doesn't give a useful exit status). – IMSoP Nov 24 '21 at 21:40
  • Hi @progman, because I don't know how to do it with XMLReader. As I described in my question, XMLReader::isValid() can't be used without a DTD as far as I know and because I have many different XML files without DTD, I don't know how I could validate with XMLReader to simply see if the syntax is correct. –  Nov 25 '21 at 08:47

1 Answers1

0

You can use the XMLReader class to read through the XML content without loading the whole content into memory. Use the read() method to read every node in the XML document. This method will emit a warning when it couldn't read the current node due to errors. The warning might look like this:

PHP Warning: XMLReader::read(): file.xml:9183: parser error : xmlParseEntityRef: no name in file.php on line X

You can use set_error_handler() to react on any warning you get, see other questions like PHP get warning and error messages?. Use a simple while() loop to read each node of the XML document until you reached the end. Check the following proof of concept:

<?php
$xml = XMLReader::open('test.xml');
$warningCount=0;
set_error_handler(function($errno, $errstr, $errfile, $errline){
    global $warningCount;
    $warningCount++;
});
while($xml->read());
echo "All fine\n";
var_dump($warningCount);

The variable $warningCount will be 0 if there were no warnings (or any other error types) and will be greater 0 if there was a warning, most likely from the read() call.

Progman
  • 16,827
  • 6
  • 33
  • 48
  • Thanks a lot progman for taking the time to explain this. It's not perfect (not you to blame, just php options) because a little bit cumbersome to use the set_error_handler, but it works! I think I go for this when checking large files. In the meantime I will search a bit if there is a quicker option. Thanks again! –  Nov 26 '21 at 13:50
  • 1
    Ok, see my latest post edit. I'm using the internal errors now. Problem solved, thanks! –  Nov 26 '21 at 14:30