0

This post is somewhat related to this post: Increase performance of PHP DOM-XML. Currently takes too long time . So it might be better to visit that post first before delve into this post

I have an array which contains 7000+ value

$arrayIds = [
    'A001',
    ...,
    'A7500'
];

This foreach loop gets text value inside <mrk> tags in a given XML file

$dom = new DOMDocument;
$dom->load('myxml.xml');

$xp = new DOMXPath($dom);

$data = [];

foreach ($arrayIds as $arrayId) {
    $expression = "//unit[@person-name=\"$arrayId\"]/seg-source/mrk";
    $col = $xp->query($expression);

    if ($col && $col->length) {
        foreach ($col as $node) {
            $data[] = $node->nodeValue;
        }
    }
}

It takes approximately 45 seconds. I can't wait any longer than 5 seconds

What is the fastest way to achieve this?

Segment of the XML file:

<unit person-name="A695" id="PTU-300" xml:space="preserve">
    <source xml:lang="en">This is Michael's speaking</source>
    <seg-source><mrk mid="0" mtype="seg">This is Michael's speaking</mrk></seg-source>
    <target xml:lang="id"><mrk mid="0" mtype="seg">This is Michael's speaking</mrk></target>
</unit>
<unit person-name="A001" id="PTU-4" xml:space="preserve">
    <source xml:lang="en">Related tutorials</source>
    <seg-source><mrk mid="0" mtype="seg">Related tutorials</mrk></seg-source>
    <target xml:lang="id"><mrk mid="0" mtype="seg">Related tutorials</mrk></target>
</unit>
...
<unit>
...
</unit>

Anyway, I'm doing this on an M1 Mac

mending3
  • 586
  • 7
  • 21
  • try do-while https://stackoverflow.com/questions/8081253/do-while-is-the-fastest-loop-in-php – Kinglish Jun 16 '21 at 17:08
  • would you tell me why do-while is related to this? – mending3 Jun 16 '21 at 17:34
  • a quick google search and SO search showed do-while as being significantly faster. The info is old tho, but 'do-while is actually faster than while by almost half.' – Kinglish Jun 16 '21 at 17:46
  • also this might be of interest: https://stackoverflow.com/questions/3048583/what-is-the-fastest-xml-parser-in-php#:~:text=The%20fastest%20parser%20will%20be,based%20DOM%20parser%20named%20SimpleXML. – Kinglish Jun 16 '21 at 17:48

1 Answers1

2

There are a couple of things you can do here to speed up your processing. First, you are currently running an XPATH query against the entire document for each ID you are looking for. The larger your document is, and the more IDs you are searching for, the longer the process is going to take. It would be more efficient to loop through the document once, and test the person-name attribute of each unit element to see if it is in your list of IDs to extract data for. That change alone will give you a decent speedup.

However at that point, XPATH is not really doing much for you, so you might as well use XMLReader to parse the document efficiently without having to load the whole thing into memory. The code is more complex, so it's more error-prone and difficult to understand, but if you need to efficiently process large XML documents, you need to use a streaming parser.

The speed difference between looping mechanisms in PHP is insignificant compared to the difference you could see between your current XPATH approach and using a streaming parser.

<?php

// Instantiate XML parser and open our file
$xmlReader = new XMLReader();
$xmlReader->open('test.xml');

// Array of person-name values we want to extract data for
$arrayIds = ['A001', 'A695'];

/*
 * Buffer for sec-source/mrk values
 * We want a sub array for each ID so we can sort the output by ID
 */
$buffer = [];
foreach($arrayIds as $currId)
{
    $buffer[$currId] = [];
}

/*
 * Flag to indicate whether or not the parser is in a unit that has
 * a person-name that we are looking for
 */
$validUnit = false;

/*
 * Flag indicating whether or not the parser is in a seg-source element.
 * Since both seg-source and target elements contain mrk elements, we need to
 * know when we are in a seg-source
 */
$inSegSource = false;

/*
 * We need to keep track of which person we are currently working with
 * so that we can populate the buffer
 */
$curPersonName = null;

// Parse the document
while ($xmlReader->read())
{
    // If we are at an opening element...
    if ($xmlReader->nodeType == XMLREADER::ELEMENT)
    {
        switch($xmlReader->localName)
        {
            case 'unit':
                // Pull the person-name
                $curPersonName = $xmlReader->getAttribute('person-name');

                /*
                 * If the value is in our array if ID, set the validUnit flag true,
                 * if not set the flag to false
                 */
                $validUnit =  (in_array($curPersonName, $arrayIds));
                break;
            case 'seg-source':
                // If we are opening a seg-source element, set the flag to true
                $inSegSource = true;
                break;
            case 'mrk':
                /*
                 * If we are in a valid unit AND inside a seg-source element,
                 * extract the element value and add it to the buffer
                 */
                if($validUnit && $inSegSource)
                {
                    $buffer[$curPersonName][] = $xmlReader->readString();
                }
                break;
        }
    }
    // If we are at a closing element...
    elseif($xmlReader->nodeType == XMLREADER::END_ELEMENT)
    {
        switch($xmlReader->localName)
        {
            case 'seg-source':
                // If we are closing a seg-source, set the flag to false
                $inSegSource = false;
                break;
        }
    }
}

$output = [];
foreach($buffer as $currId=>$currData)
{
    $output = array_merge($output, $currData);
}

print_r($output);
Rob Ruchte
  • 3,569
  • 1
  • 16
  • 18
  • Thank you for answering. However, this doesn't follow the order of `$arrayIds` when PHP parses the XML. The expected result is if the value of `$arrayIds` is `['A001', 'A695']` then the `$data` will be `["Related tutorials", "This is Michael's speaking"]` – mending3 Jun 17 '21 at 04:36
  • You could add that ordering logic pretty easily. Your question is about parsing speed, this is how you get it. – Rob Ruchte Jun 17 '21 at 05:01
  • How do I do that in XMLReader? I'm not too familiar with XMLReader. Code in my question can do that but the problem is the speed is 'ugly' – mending3 Jun 17 '21 at 05:04
  • Just set the buffer up with keys for each ID, so you’re adding values to a sub array associated with the ID. then after populating the buffer, loop through you ID array an pull out the buffer contents by ID. I’m not in front of my computer right now, so I can’t update the answer. But this has nothing to do with the reader, it’s just basic programming logic. – Rob Ruchte Jun 17 '21 at 05:12
  • Okay. Thank you. If you don't mind, kindly write the logic there after you're in front of computer – mending3 Jun 17 '21 at 05:17
  • 1
    I've updated the example to output the data ordered by the index of the associated ID in the $arrayIds – Rob Ruchte Jun 17 '21 at 14:00
  • would you help me more a little bit? currently if a `` contains multiple `` it always gets the last text of `` . I want to have that both. how do I do that? – mending3 Jun 18 '21 at 08:08
  • 1
    My code does that, I just tested with multiple mrk elements. – Rob Ruchte Jun 18 '21 at 14:13