Regex to get MTOM binary content

Question

I am trying to get the MTOM binary content using a extended class of SoapClient, the response is something like that:

    --uuid:8c73f23e-47d9-49fb-a61c-c1df7b19a306+id=2
    Content-ID: 
    <http://tempuri.org/0>
    Content-Transfer-Encoding: 8bit
    Content-Type: application/xop+xml;charset=utf-8;type="text/xml"    

    <big-xml-here>

           <xop:Include href="cid:http://tempuri.org/1/636644204289948690" xmlns:xop="http://www.w3.org/2004/08/xop/include"/>

        </big-xml-here>

--uuid:8c73f23e-47d9-49fb-a61c-c1df7b19a306+id=2--

Right after the XML, the MTOM response continue with the binaries related to the "cid" URL:

Content-ID: <http://tempuri.org/1/636644204289948690>
Content-Transfer-Encoding: binary
Content-Type: application/octet-stream

%PDF-1.4
%���� (lots of binary content here)

--uuid:7329cfb8-46a4-40a8-b15b-39b7b0988b57+id=4--

To extract everything I've tried this code:

$xop_elements = null;
        preg_match_all('/<xop[\s\S]*?\/>/', $response, $xop_elements);

        $xop_elements = reset($xop_elements);

        if (is_array($xop_elements) && count($xop_elements)) {

            foreach ($xop_elements as $xop_element) {

                $cid = null;
                preg_match('/cid:(.*?)"/', $xop_element, $cid);

                if(isset($cid[1])){
                    $cid = $cid[1];
                    $binary = null;
                    preg_match("/Content-ID:.*?$cid.*?(.*?)uuid/", $response, $binary);
                    var_dump($binary);
                    exit();
                }
            }
        }

Although the preg_match_all and the first preg_match are working, the last one:

/Content-ID:.*?$cid.*?(.*?)uuid/

is not working

On the original source: https://github.com/debuss/MTOMSoapClient/blob/master/MTOMSoapClient.php

the regex is

/Content-ID:[\s\S].+?'.$cid.'[\s\S].+?>([\s\S]*?)--uuid/

but I got an error on PHP 7:

preg_match(): Unknown modifier '/'

Is there a away to get MTOM binary of each CID?

Thanks in advance!

Don't parse the soap response yourself with regex, use a soap client instead. Ex: https://stackoverflow.com/a/4195132/4110233 — TheChetan, Jun 23 '18 at 05:56

score 0 · Answer 1 · answered Jun 18 '18 at 20:25

You need to first unquote $cid as that is causing the your first error

$cid = preg_quote($cid[1], '/');

Next you need to use the s modifier flag so that . matches new lines also

preg_match("/Content-ID:.*?$cid.*?(.*?)uuid/s", $response, $binary);

s (PCRE_DOTALL) If this modifier is set, a dot metacharacter in the pattern matches all characters, including newlines. Without it, newlines are excluded. This modifier is equivalent to Perl's /s modifier. A negative class such as [^a] always matches a newline character, independent of the setting of this modifier.

wp78de · Answer 2 · 2018-06-25T07:40:06.663

As I understand, you are trying to adjust the original code to your modified file SOAP version.

Instead of a number, you want to capture the whole http://tempuri.org/1/636644204289948690 in the $cid variable (you may want to rename the var). To do so you could use the following regex, that matches everything but a double quote in capture group 1: cid:([^"]+)

preg_match('/cid:([^"]+)/', $xop_element, $cid);

So far, so good. Guessing from your description you should use the following pattern to capture the binary part:

'%Content-ID: <'.$cid.'>([\s\S]*?)--uuid%'

We use a modified dot [\s\S] to match across multiple lines (as shown as well in the original implementation). Otherwise, add the s|single line flag or (?s) inline modifier. Also, I use alternative regex delimiters % to avoid escaping problems. It's still sound to use preg_quote($cid[1], '%') as suggested by Tarun.

Demo

Now, you can retrieve the block in question from capture group 1:

trim($binary[1]);

Regex to get MTOM binary content

2 Answers2