0

I­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­ have an XML that is not valid, there are many problems in the file itself, and I­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­ need to do daily reimports from that file. The structure looks like this:

<products>
    <product no="AP1222-00" name="Colours kravata" price="456" currency="Kč">
        <description name="POPIS PRODUKTU">Kravata Premier Line v moderních barvách. Materiál polyester. Baleno v sáčku s černým poutkem.</description>
    </product>
    <product no="AP1222-22" name="Colours kravata" price="330" currency="Kč">
        <description name="POPIS PRODUKTU">Blabla.</description>
    </product>
</products>

I­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­s there any easy way to get the array of products, so I can fix the problems in t­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­he files before importing it? SimpleXML etc. don't work, as the file is invalid.

Edit: Here's one complete products of the XML for reference, notice the double quotes in product name:

<products>
    <product no="AP1222-00" name="" Colours" kravata" price="456" currency="Kč">
        <folders>
            <folder category="<b>COOL 2017</b>" subcategory="TEXTILE & FASHION"/>
            <folder category="TEXTILE & FASHION" subcategory="Kravaty a šály"/>
        </folders>
        <description name="POPIS PRODUKTU">Kravata Premier Line v moderních barvách. Materiál polyester. Baleno v sáčku s
            černým poutkem.
        </description>
        <properties>
            <property name="KS / KARTON" value="100"/>
            <property name="HMOTNOST KARTONU" value="6"/>
            <property name="NETTO HMOTNOST / KARTON" value="5"/>
            <property name="DIM1" value="15"/>
            <property name="DIM2" value="80"/>
            <property name="DIM3" value="35"/>
            <property name="TECHNOLIGIE POTISKU" value="T1 (8C, 50×80 MM)"/>
            <property name="TARIF" value="6215200090"/>
            <property name="Min. mn. (ks)" value=""/>
            <property name="M3/CARTON" value="0.042"/>
            <property name="COOL 2017 KAPITOLA" value="TEXTILE AND FASHION"/>
            <property name="COOL 2017 STRANY" value="525"/>
            <property name="main category" value="fashion"/>
        </properties>
        <images>
            <image src="http://www.andapresent.com/kepek/cms/original/83653.jpg"/>
        </images>
        <stocks>
            <stock name="navi_central" value="2"/>
            <stock name="navi_arrive" value="" date=""/>
            <stock name="eu_central" value="" date=""/>
            <stock name="eu_arrive_1" value="" date=""/>
            <stock name="eu_arive_2" value="" date=""/>
        </stocks>
    </product>
</products>
Casimir et Hippolyte
  • 88,009
  • 5
  • 94
  • 125
user1049961
  • 2,656
  • 9
  • 37
  • 69
  • I don't see anything invalid about the above XML? – M. Eriksson Feb 21 '17 at 07:30
  • I just posted the structure as an example, there are many more lines with things like `name=""Something" else "` etc. – user1049961 Feb 21 '17 at 07:41
  • 1
    if the file contains errors try to parse it as an html file (`DOMDocument::loadHTML`) – Casimir et Hippolyte Feb 21 '17 at 07:43
  • Please describe precisely how it's *not valid*. You can't fix XML that is empty, you can't fix XML that is pure image with extension renamed to `.png`... – Justinas Feb 21 '17 at 07:47
  • You should really post examples of the invalid XML. Hard to answer the question if we don't even know what the errors are. But I think you will have a hard time writing a parser, specially using regex, if the file is inconsistent. That's usually where regex falls short. So the answer to your question is most likely: No, there is no easy way. – M. Eriksson Feb 21 '17 at 07:48
  • Garbage In, Garbage Out. You can't get valid output from invalid input. For example if I need something that returns 5 by adding two numbers together then there's no meaningful way to get that output from the input of 2 and 2. The same is just of true of XML, with the exception that it would be even harder to get something valid out of XML because it's inherently more complex than adding numbers to get 5. – GordonM Feb 21 '17 at 15:41

1 Answers1

3

DOMDocument::loadHTML method is more lenient than the XML parser and is able to automatically fix many errors. The problem is that you have no control on how libxml will fix these errors.

That's why I suggest an other approach with DOMDocument::loadXML (that uses the XML parser), but this time I will try to correct errors with custom rules (that aren't universal fixes but are adapted to the specific situation)

When you switch libxml_use_internal_errors() to true, all xml errors are stored in an array of libXMLErr instances. Each of them contains an error code, the error line and the error column. (Note that the first line and the first column are 1).

$xml = file_get_contents('file.xml');

$dom = new DOMDocument;
libxml_use_internal_errors(true);
$dom->loadXML($xml);
$errors = libxml_get_errors();

if ($errors) {
    // LIBXML constant name, LIBXML error code // LIBXML error message
    define('XML_ERR_LT_IN_ATTRIBUTE', 38); // Unescaped '<' not allowed in attributes values
    define('XML_ERR_ATTRIBUTE_WITHOUT_VALUE', 41); // Specification mandate value for attribute
    define('XML_ERR_NAME_REQUIRED', 68); // xmlParseEntityRef: no name

    $rules = [
        XML_ERR_LT_IN_ATTRIBUTE => [
            'pattern' => '~(?:(?!\A)|.{%d}")[^<"]*\K<~A',
            'replacement' => [ 'string' => '&lt;', 'size' => 3 ]
        ],
        XML_ERR_ATTRIBUTE_WITHOUT_VALUE => [
            'pattern' => '~^.{%d}\h+\w+\h*=\h*"[^"]*\K"([^"]*)"~',
            'replacement' => [ 'string' => '&quot;$1&quot;', 'size' => 10 ]
        ],
        XML_ERR_NAME_REQUIRED => [
            'pattern' => '~^.{%d}[^&]*\K&~',
            'replacement' => [ 'string' => '&amp;', 'size' => 4 ]
        ]
    ];

    $previousLineNo = 0;
    $lines = explode("\n", $xml);

    foreach ($errors as $error) {

        if (!isset($rules[$error->code])) continue;

        $currentLineNo = $error->line;

        if ( $currentLineNo != $previousLineNo )
            $offset = -1;

        $currentLine = &$lines[$currentLineNo - 1];
        $pattern = sprintf($rules[$error->code]['pattern'], $error->column + $offset);
        $currentLine = preg_replace($pattern,
                                    $rules[$error->code]['replacement']['string'],
                                    $currentLine, -1, $count);
        $offset += $rules[$error->code]['replacement']['size'] * $count;
        $previousLineNo = $currentLineNo;
    }

    $xml = implode("\n", $lines);

    libxml_clear_errors();
    $dom->loadXML($xml);
    $errors = libxml_get_errors();
}

var_dump($errors);

$s = simplexml_import_dom($dom);

echo $s->product[0]["name"];

The size in the rules array is the difference between the size of the replacement string and the size of the replaced string. This way when there are several errors on the same line, the position of the next error is updated with $offset.

libxml error constants are not available in PHP, this is the reason why they are manually defined (only to make the code more readable). You can find them here.

Casimir et Hippolyte
  • 88,009
  • 5
  • 94
  • 125
  • 1
    There's no reason anyone should have downvoted this excellent answer, which I cite in the [**canonical answer on bad XML**](https://stackoverflow.com/a/44765546/290085). – kjhughes Dec 24 '17 at 02:00