27

I am looking for a way to find JSON data in a string. Think about it like wordpress shortcodes. I figure the best way to do it would be a regular Expression. I do not want to parse the JSON, just find all occurences.

Is there a way in regex to have matching numbers of parentheses? Currently I run into that problem when having nested objects.

Quick example for demonstration:

This is a funny text about stuff,
look at this product {"action":"product","options":{...}}.
More Text is to come and another JSON string
{"action":"review","options":{...}}

As a result i would like to have the two JSON strings. Thanks!

rootman
  • 660
  • 1
  • 8
  • 18
  • See this question [Regex to validate JSON](http://stackoverflow.com/questions/2583472/regex-to-validate-json). – Amal Murali Feb 24 '14 at 17:30
  • I think the bigger problem here is why do you have JSON strings embedded in a plain text block? I think improving the design may be a better way to go here than trying to build a regex to find JSON substrings in the wild. – TypeIA Feb 24 '14 at 17:30
  • _Is there a way in regex to have matching numbers of parentheses?_ -> No. Regex is not made for that. Why don't you use `json_decode` and parse the result array for data you need? – ntaso Feb 24 '14 at 17:30
  • You realize that `42` would be valid JSON? As would `"Hi There!"`? Unless you restrict your json to be an encoded object only, it's pretty much impossible to detect ALL valid json forms. – Marc B Feb 24 '14 at 17:32
  • I want to use these JSON objects as shortcodes like in wordpress. The wordpress implementation is messy, at least that is what i think. To give data to the functions I like to run, I figured JSON would be the best way. As a workaround i could do something like that [[{json}]] and just match [[...]]. However I want to make it as simple as possible. – rootman Feb 24 '14 at 17:33
  • @rootman Adding some kind of template placeholder such as `[[...]]` IS a simpler approach than trying to pick up JSON out of plaintext. Just make sure your template identifier is something that you would not expect to ever occur naturally in your text and likely should not involve either `{}` or `[]` which are part of JSON syntax and could easily mess up your parsing. – Mike Brant Feb 24 '14 at 17:46

4 Answers4

76

Extracting the JSON string from given text

Since you're looking for a simplistic solution, you can use the following regular expression that makes use of recursion to solve the problem of matching set of parentheses. It matches everything between { and } recursively.

Although, you should note that this isn't guaranteed to work with all possible cases. It only serves as a quick JSON-string extraction method.

$pattern = '
/
\{              # { character
    (?:         # non-capturing group
        [^{}]   # anything that is not a { or }
        |       # OR
        (?R)    # recurses the entire pattern
    )*          # previous group zero or more times
\}              # } character
/x
';

preg_match_all($pattern, $text, $matches);
print_r($matches[0]);

Output:

Array
(
    [0] => {"action":"product","options":{...}}
    [1] => {"action":"review","options":{...}}
)

Regex101 Demo


Validating the JSON strings

In PHP, the only way to know if a JSON-string is valid is by applying json_decode(). If the parser understands the JSON-string and is according to the defined standards, json_decode() will create an object / array representation of the JSON-string.

If you'd like to filter out those that aren't valid JSON, then you can use array_filter() with a callback function:

function isValidJSON($string) {
    json_decode($string);
    return (json_last_error() == JSON_ERROR_NONE);
}

$valid_jsons_arr = array_filter($matches[0], 'isValidJSON');

Online demo

Community
  • 1
  • 1
Amal Murali
  • 75,622
  • 18
  • 128
  • 150
  • 1
    since Java had no recursive steps I just used the pattern 3 times: \{(?:[^{}]|(\{(?:[^{}]|(\{[^{}]*\}))*\}))*\} was sufficient for me... but hacky... – rufreakde Nov 27 '18 at 16:02
  • 1
    One case where this will not work is when you have curly braces inside a string. e.g: `{"text":"abc { def"}` – axxis Jan 22 '20 at 14:15
  • it has problem if the string is too long. But, I found out that adding `ThreadStackSize` on `apache httpd.conf` will solve the issue. My question is, what is ThreadStackSize for? – Codeblooded Saiyan Sep 12 '22 at 23:12
7

Javascript folks looking for similar regex. The (?R) which is recursive regex pattern is not supported by javascript, python, and other languages as such.

Note: It's not 1 on 1 replacement.

 \{(?:[^{}]|(?R))*\} # PCRE Supported Regex

Steps:

  1. Copy the whole regex and replace ?R which copied string example
  • level 1 json => \{(?:[^{}]|(?R))*\} => \{(?:[^{}]|())*\}
  • level 2 json => \{(?:[^{}]|(\{(?:[^{}]|(?R))*\}))*\} => \{(?:[^{}]|(\{(?:[^{}]|())*\}))*\}
  • level n json => \{(?:[^{}]|(?<n times>))*\}
  1. when decided to stop at some level replace ?R with blank string.

Done.

Krishna
  • 198
  • 2
  • 10
4

I would add a * to include the nested objects:

{(?:[^{}]*|(?R))*}

Check it Demo

0

Adding to the answers that suggest ?R for recursion: If you want to match other things as well in a regex string, not just the json object, (i.e.: a json object followed by a string, like key: {jsonobject}), then you want to recurse only the json rule:

(?<j>\{(?:[^{}]|(?&j))*\})

I am using named subpatterns in this example. Notice the ?<j> and the (?&j), which define the subpattern, and reference it respectively). With this you can match the following as an example:

  • Only match the json objects that are followed by "ERROR: ":
ERROR: (?<j>\{(?:[^{}]|(?&j))*\})
ERROR: {"some": "info"}     # will match
INFO: {"some": "info"}      # won't match

See the example on regex101

MuhsinFatih
  • 1,891
  • 2
  • 24
  • 31