I have a PHP application which interfaces with a payment processor to handle credit cards. Sometimes, the post response from the processor fails (e.g. momentarily glitch in the matrix), and we don't get the automated notification of the payment. In these cases, we fall back to entering the data from a confirmation email that's always sent. I want my code to parse out the text of the email to get the data, and it seems like a perfect use case for preg_match_all. The problem is that the email is badly formatted: it comes in name : value
pairs, but they're all on one line, and often the value is blank, which is messing me up.
I'm pretty good with regex basics (quantifiers, grouping, character classes, anchors, modifiers), but really no experience with lookahead and backreferences, and it's not at all obvious to me whether they can help or not.
Sample data might look something like this (again, this would all be on a single line, just wrapped for ease of reading):
bypass_first_page : x_company : x_cust_id : 12345 x_customer_ip : x_customer_tax_id : x_description : 98765 x_duty : x_email_customer : an_example@example.com x_fax : x_footer_email_receipt : x_fp_hash : 747ffeddfe4e106a9c67363ebff996ad x_fp_timestamp : 1525100766 x_invoice_num : R000098765 x_login : MY-LOGIN-ID x_logo_url : x_merchant_email : x_method : x_phone : (416) 555-1212 x_po_num : x_receipt_link_method : GET x_reference_3 : 1234 x_relay_response : TRUE x_relay_url :
I want output that looks like this:
[
[bypass_first_page] =>
[x_company] =>
[x_cust_id] => 12345
[x_customer_ip] =>
[x_customer_tax_id] =>
[x_description] => 98765
[x_duty] =>
[x_email_customer] => an_example@example.com
[x_fax] =>
[x_footer_email_receipt] =>
[x_fp_hash] => 747ffeddfe4e106a9c67363ebff996ad
[x_fp_timestamp] => 1525100766
[x_invoice_num] => R000098765
[x_login] => MY-LOGIN-ID
[x_logo_url] =>
[x_merchant_email] =>
[x_method] =>
[x_phone] => (416) 555-1212
[x_po_num] =>
[x_receipt_link_method] => GET
[x_reference_3] => 1234
[x_relay_response] => TRUE
[x_relay_url] =>
]
Important things to note:
- Field names mostly, but not exclusively, start with x_. If it's only possible to find a solution that requires this, it's probably workable.
- Field names do not have spaces in them.
- Some field names have numbers in them.
- Values can have spaces (e.g. phone number) and underscores (e.g. email address) in them.
- When there is no value, there is only a single space between the colon and the next field name.
The closest I've come is:
/([\w\d_]+) ?: ([^:]+)/
but this produces output like:
[
[bypass_first_page] => x_company
[x_cust_id] => 12345 x_customer_ip
[x_customer_tax_id] => x_description
...
]
As you can see from this regex101 link, this is failing in that there are colons that aren't matched against anything, and field names end up in the values (by themselves or concatenated with the actual value). I feel like if there was a modifier that required that the entire string be matched, or anchors that somehow indicated that one match has to start where the previous one ended, that could solve this pretty easily, but I can't find any mention of such a thing anywhere. May just be that I don't know what that thing is called?