10

I'm implementing some kind of parser and I need to locate and deserialize json object embedded into other semi-structured data. I used regexp:

\\{\\s*title.*?\\}

to locate object

{title:'Title'}

but it doesn't work with nested objects because expression matches only first found closing curly bracket. For

{title:'Title',{data:'Data'}}

it matches

{title:'Title',{data:'Data'}

so string becomes invalid for deserialization. I understand that there's a greedy business coming into account but I'm not familiar with regexps. Could you please help me to extend expression to consume all available closing curly brackets.

Update:

To be clear, this is an attempt to extract JSON data from semi-structured data like HTML+JS with embedded JSON. I'm using GSon JAVA lib to actually parse extracted JSON.

Viktor Stolbin
  • 2,899
  • 4
  • 32
  • 53
  • Watch out for "OMG, don't use Regex it's eevil!!" – MDEV Jul 22 '13 at 11:30
  • .. but in all seriousness - why? What's the data to hand, and what do you need to achieve with it – MDEV Jul 22 '13 at 11:30
  • 3
    @ViktorStolbin There are premade JSON parsing libraries. Also, since JSON isn't a regular language, it cannot be correctly parsed with regular expressions (just like HTML). – Eric Finn Jul 22 '13 at 11:32
  • You really can't do this easily with regex. JSON parser examples are aplenty out there; if possible just pick up one and you would be much better off. – Sanjay T. Sharma Jul 22 '13 at 11:33
  • what language do you use? – Casimir et Hippolyte Jul 22 '13 at 11:34
  • Guys, I'm not re-inventing parser at all, this is just a clean up business to extract JSON data out from rubbish that have nothing similar to clean well formed JSON. This is some kind of HTML+JS+JSON embedded. – Viktor Stolbin Jul 22 '13 at 11:36
  • 1
    @ViktorStolbin: I know you aren't re-inventing the JSON lib. What I'm trying to say here is that this is a two part activity: 1. Extract JSON string out of the semi-structured data 2. Pass that valid piece of JSON string to GSON to parse it into Java constructs. For the first, regex is not sufficient since it can't perform "brace matching" hence the suggestions. – Sanjay T. Sharma Jul 22 '13 at 11:55
  • @SmokeyPHP What would you suggest having question updated? – Viktor Stolbin Jul 22 '13 at 11:59

4 Answers4

13

This recursive Perl/PCRE regular expression should be able to match any valid JSON or JSON5 object, including nested objects and edge cases such as braces inside JSON strings or JSON5 comments:

/(\{(?:(?>[^{}"'\/]+)|(?>"(?:(?>[^\\"]+)|\\.)*")|(?>'(?:(?>[^\\']+)|\\.)*')|(?>\/\/.*\n)|(?>\/\*.*?\*\/)|(?-1))*\})/

Of course, that's a bit hard to read, so you might prefer the commented version:

m{
  (                               # Begin capture group (matching a JSON object).
    \{                              # Match opening brace for JSON object.
    (?:                             # Begin non-capturing group to contain alternations.
      (?>[^{}"'\/]+)                  # Match a non-empty string which contains no braces, quotes or slashes, without backtracking.
    |                               # Alternation; next alternative follows.
      (?>"(?:(?>[^\\"]+)|\\.)*")      # Match a double-quoted JSON string, without backtracking.
    |                               # Alternation; next alternative follows.
      (?>'(?:(?>[^\\']+)|\\.)*')      # Match a single-quoted JSON5 string, without backtracking.
    |                               # Alternation; next alternative follows.
      (?>\/\/.*\n)                    # Match a single-line JSON5 comment, without backtracking.
    |                               # Alternation; next alternative follows.
      (?>\/\*.*?\*\/)                 # Match a multi-line JSON5 comment, without backtracking.
    |                               # Alternation; next alternative follows.
      (?-1)                           # Recurse to most recent capture group, to match a nested JSON object.
    )*                              # End of non-capturing group; match zero or more repetitions of this group.
    \}                              # Match closing brace for JSON object.
  )                               # End of capture group (matching a JSON object).
}x
  • 1
    Is this a serious answer to the question? – Arefe Jul 11 '20 at 06:02
  • 3
    Of course it's a serious answer! This regular expression does *exactly* what the question asked for: locate and extract a JSON object embedded inside non-JSON data, handling nested objects and edge cases correctly. The match can then be used with a JSON parser to deserialize the JSON object. – Deven T. Corzine Feb 24 '21 at 22:10
  • I'm not very good with the regex, but this looked so weird. Anyway, thanks for the confirmation. – Arefe Feb 25 '21 at 03:07
  • 1
    I just had a need to use this regular expression myself, but I needed to match many JSON objects buried in a log file that was otherwise plain text. It worked perfectly, I just had to add the "/g" flag to get all the matches back instead of a single match. – Deven T. Corzine Mar 12 '22 at 03:52
  • This is an excellent answer and it works really good in js. Unfortunately it does not work in .Net and thus Powershell because of the last recursive capture group. So if possible I would really appreciate a hint on how to simplify this regex. – Nils Aug 16 '23 at 20:55
  • The recursive capture group is the mechanism that enables this regular expression to fully parse JSON. If you want to try to adapt this regex to work with .NET, it looks like it might be possible, based on https://stackoverflow.com/questions/67380079/how-to-make-a-recursive-regex but it would probably be difficult. – Deven T. Corzine Aug 17 '23 at 01:13
7

As others have suggested, a full-blown JSON parser is probably the way to go. If you want to match the key-value pairs in the simple examples that you have above, you could use:

(?<=\{)\s*[^{]*?(?=[\},])

For the input string

{title:'Title',  {data:'Data', {foo: 'Bar'}}}

This matches:

 1. title:'Title'
 2. data:'Data'
 3. foo: 'Bar'
davidfmatheson
  • 3,539
  • 19
  • 27
3

Thanks to @Sanjay T. Sharma that pointed me to "brace matching" because I eventually got some understanding of greedy expressions and also thanks to others for saying initially what I shouldn't do. Fortunately it turned out it's OK to use greedy variant of expression

\\{\s*title.*\\}

because there is no non-JSON data between closing brackets.

Viktor Stolbin
  • 2,899
  • 4
  • 32
  • 53
1

This is absolutely horrible and I can't believe I'm actually putting my name to this solution, but could you not locate the first { character that is in a Javascript block and attempt to parse the remaining characters through a proper JSON parsing library? If it works, you've got a match. If it doesn't, keep reading until the next { character and start over.

There are a few issues there, but they can probably be worked around:

  • you need to be able to identify Javascript blocks. Most languages have HTML to DOM libraries (I'm a big fan of Cyberneko for Java) that makes it easy to focus on the <script>...</script> blocks.
  • your JSON parsing library needs to stop consuming characters from the stream as soon as it spots an error, and it needs to not close the stream when it does.

An improvement would be, once you've found the first {, to look for the matching } one (a simple counter that is incremented whenever you find a { and decremented when you find a } should do the trick). Attempt to parse the resulting string as JSON. Iterate until it works or you've ran out of likely blocks.

This is ugly, hackish and should never make it to production code. I get the impression that you only need it for a batch-job, though, which is why I'm even suggesting it.

Nicolas Rinaudo
  • 6,068
  • 28
  • 41