-1

I need to find a key value pair of a JSON document using regular expression. The problem is that I can't properly understand how to scope the selection of the matching data.

Using this regexp,

"email"\s*:\s*".*"

it will select the entire document till it finds the last ".

enter image description here

But i wanted to only select up-to the first " so the selection would be like this:

"email":"foobar@foo.bar"

In order to achieve this I have tried using anchors like this:

"email"\s*:\s*^".*"$

but it is not working as expected. What would be a better way to achieve this?

Please note that if the email contains a double quote then the json string will be like this:

{"email":"foo@bar.c\"om"}

In this above scenario we might need to be able to skip all the \" as well?

Also I need to fetch this data from a large file with 1.6m + inline JSON documents.

Playground: https://regexr.com/552pt

rakibtg
  • 5,521
  • 11
  • 50
  • 73
  • 1
    Don't use RegEx to parse JSON, they are not suited for that. Instead, use [`json_decode()`](https://www.php.net/manual/en/function.json-decode.php) – Cid May 21 '20 at 11:11
  • Using `json_decode()` makes it so slow actually – rakibtg May 21 '20 at 11:12

1 Answers1

0

Just add a question mark to cause it to look for the minimum instead of the maximum

"email"\s*:\s*".*?"
Rob Kwasowski
  • 2,690
  • 3
  • 13
  • 32
  • 2
    What if the email contains already a double quote ? It's a valid format – Cid May 21 '20 at 11:14
  • @Cid Good point actually, and yes it breaks if the email contains already a double quote. – rakibtg May 21 '20 at 11:15
  • 1
    We might need to skip `\"` from a string like this `{"email":"foo@bar.c\"om"}`? – rakibtg May 21 '20 at 11:18
  • The name part of an email address can technically be enclosed in double quotes, but it can't contain just one double quote, as far as I know. But that format is also disallowed by most webmail providers such as Gmail, and so could also be disallowed by whatever is producing that JSON. – Rob Kwasowski May 21 '20 at 11:55
  • 1
    If you really want to skip `\"`, this should do: `"email"\s*:\s*".*?((?<!\\)")`. I know this is what you want, but you should really be using a json parser. You're going to expand upon this regex with edge cases until it becomes incomprehensible. – Sander Saelmans May 21 '20 at 11:58