4

I have a PHP application which interfaces with a payment processor to handle credit cards. Sometimes, the post response from the processor fails (e.g. momentarily glitch in the matrix), and we don't get the automated notification of the payment. In these cases, we fall back to entering the data from a confirmation email that's always sent. I want my code to parse out the text of the email to get the data, and it seems like a perfect use case for preg_match_all. The problem is that the email is badly formatted: it comes in name : value pairs, but they're all on one line, and often the value is blank, which is messing me up.

I'm pretty good with regex basics (quantifiers, grouping, character classes, anchors, modifiers), but really no experience with lookahead and backreferences, and it's not at all obvious to me whether they can help or not.

Sample data might look something like this (again, this would all be on a single line, just wrapped for ease of reading):

bypass_first_page : x_company : x_cust_id : 12345 x_customer_ip : x_customer_tax_id : x_description : 98765 x_duty : x_email_customer : an_example@example.com x_fax : x_footer_email_receipt : x_fp_hash : 747ffeddfe4e106a9c67363ebff996ad x_fp_timestamp : 1525100766 x_invoice_num : R000098765 x_login : MY-LOGIN-ID x_logo_url : x_merchant_email : x_method : x_phone : (416) 555-1212 x_po_num : x_receipt_link_method : GET x_reference_3 : 1234 x_relay_response : TRUE x_relay_url :

I want output that looks like this:

[
    [bypass_first_page] =>
    [x_company] =>
    [x_cust_id] => 12345
    [x_customer_ip] =>
    [x_customer_tax_id] =>
    [x_description] => 98765
    [x_duty] =>
    [x_email_customer] => an_example@example.com
    [x_fax] =>
    [x_footer_email_receipt] =>
    [x_fp_hash] => 747ffeddfe4e106a9c67363ebff996ad
    [x_fp_timestamp] => 1525100766
    [x_invoice_num] => R000098765
    [x_login] => MY-LOGIN-ID
    [x_logo_url] =>
    [x_merchant_email] =>
    [x_method] =>
    [x_phone] => (416) 555-1212
    [x_po_num] =>
    [x_receipt_link_method] => GET
    [x_reference_3] => 1234
    [x_relay_response] => TRUE
    [x_relay_url] =>
]

Important things to note:

  • Field names mostly, but not exclusively, start with x_. If it's only possible to find a solution that requires this, it's probably workable.
  • Field names do not have spaces in them.
  • Some field names have numbers in them.
  • Values can have spaces (e.g. phone number) and underscores (e.g. email address) in them.
  • When there is no value, there is only a single space between the colon and the next field name.

The closest I've come is:

/([\w\d_]+) ?: ([^:]+)/

but this produces output like:

[
    [bypass_first_page] => x_company
    [x_cust_id] => 12345 x_customer_ip
    [x_customer_tax_id] => x_description
    ...
]

As you can see from this regex101 link, this is failing in that there are colons that aren't matched against anything, and field names end up in the values (by themselves or concatenated with the actual value). I feel like if there was a modifier that required that the entire string be matched, or anchors that somehow indicated that one match has to start where the previous one ended, that could solve this pretty easily, but I can't find any mention of such a thing anywhere. May just be that I don't know what that thing is called?

Greg Schmidt
  • 5,010
  • 2
  • 14
  • 35
  • Does the payment gateway not offer an API against which to reconcile payments? Is the absence of a `:` between `12345` and `x_customer_ip` a mistake in your sample data, or does it appear like that in the response email? You may have more luck using something like `explode(' x_', $string);` – fubar May 01 '18 at 00:59
  • @fubar, there's no `:`: between `12345` and `x_customer_ip` because `12345` is the value for the previous field. There's no `:`s after values, just between names and values. Good idea about the API, I'll look into whether that exists. – Greg Schmidt May 01 '18 at 01:24
  • Okay, got it. Cheers. – fubar May 01 '18 at 01:30

2 Answers2

4

The simplest solution I have found (so far) goes like this:

(\w+) : ?(.*?)(?= ?\w+ :|$)

Demo

Finally, adding ? at the end as suggested by Allen makes the output even nicer.

(\w+) : ?(.*?)(?= ?\w+ :|$) ?

Output:

[0] => Array
    (
        [0] => bypass_first_page : 
        [1] => x_company : 
        [2] => x_cust_id : 12345
        [3] => x_customer_ip : 
        [4] => x_customer_tax_id : 
        [5] => x_description : 98765
        [6] => x_duty : 
        [7] => x_email_customer : an_example@example.com
        [8] => x_fax : 
        [9] => x_footer_email_receipt : 
        [10] => x_fp_hash : 747ffeddfe4e106a9c67363ebff996ad
        [11] => x_fp_timestamp : 1525100766
        [12] => x_invoice_num : R000098765
        [13] => x_login : MY-LOGIN-ID
        [14] => x_logo_url : 
        [15] => x_merchant_email : 
        [16] => x_method : 
        [17] => x_phone : (416) 555-1212
        [18] => x_po_num : 
        [19] => x_receipt_link_method : GET
        [20] => x_reference_3 : 1234
        [21] => x_relay_response : TRUE
        [22] => x_relay_url :
    )

[1] => Array
    (
        [0] => bypass_first_page
        [1] => x_company
        [2] => x_cust_id
        [3] => x_customer_ip
        [4] => x_customer_tax_id
        [5] => x_description
        [6] => x_duty
        [7] => x_email_customer
        [8] => x_fax
        [9] => x_footer_email_receipt
        [10] => x_fp_hash
        [11] => x_fp_timestamp
        [12] => x_invoice_num
        [13] => x_login
        [14] => x_logo_url
        [15] => x_merchant_email
        [16] => x_method
        [17] => x_phone
        [18] => x_po_num
        [19] => x_receipt_link_method
        [20] => x_reference_3
        [21] => x_relay_response
        [22] => x_relay_url
    )

[2] => Array
    (
        [0] => 
        [1] => 
        [2] => 12345
        [3] => 
        [4] => 
        [5] => 98765
        [6] => 
        [7] => an_example@example.com
        [8] => 
        [9] => 
        [10] => 747ffeddfe4e106a9c67363ebff996ad
        [11] => 1525100766
        [12] => R000098765
        [13] => MY-LOGIN-ID
        [14] => 
        [15] => 
        [16] => 
        [17] => (416) 555-1212
        [18] => 
        [19] => GET
        [20] => 1234
        [21] => TRUE
        [22] => 
    )

I did some more tests and think this should fit the bill.

PS: The first solution I came up with was this:

(?:^| )(\w+) : ?(?!\w+ : )(?:(.*?)(?= \w+ :|$))?

It's a bit more verbose but might be also helpful to you.

wp78de
  • 18,207
  • 7
  • 43
  • 71
  • Great answer and beautiful regex +1!!! Impressive! By the way, shouldn't you add ` ?` at the end of it? https://regex101.com/r/qLrEXM/1/ – Allan May 01 '18 at 05:55
  • This looks perfect! If you see fit to explain how it works a little bit, that would be a bonus. – Greg Schmidt May 01 '18 at 16:06
1

Solution 1:

I have adapted your regex in the following way:

(\w+|x_[^: ]*) ?:( ((?!x_|\()[^:() ]*|(?:(\d*[)( -])*\d+))?)? ?

It is not perfect but it works fine on your example as you can see at: https://regex101.com/r/tTr4lG/2

Note that it has also the x_ starting limitation.

Solution 2: check link: https://regex101.com/r/tTr4lG/3

the starting x_ limitation has been removed!

(?<= |^)(([\w\d_]+) : ([A-Za-z0-9-]+(?= )|(\d*[)( -])*\d+|[A-Za-z0-9-_.]+@[A-Za-z0-9-_.]+\.[A-Za-z]+(?= ))?) ?

limitations: space character is only accepted for phone numbers and the underscore is only accepted in mail addresses.

Allan
  • 12,117
  • 3
  • 27
  • 51
  • Do you see any way to make it work without relying on field names starting with x_? As per the discussion on @fubar's answer. – Greg Schmidt May 01 '18 at 02:33
  • @GregSchmidt: have a look at my 2nd answer – Allan May 01 '18 at 03:05
  • I think there are two problems: a) there is an additional space after the last colon, and b) this solution assumes specific field/value formats, which is not absolutely necessary. Check my solution. – wp78de May 01 '18 at 05:48