splitting JSON string using regex

Question

I want to split a JSON document and which has a pattern like [[[1,2],[3,4][5,6]]] using regex. The pairs represent x ad y. What I want to do it to take this string and produce a list with {"1,2", "3,4","5,6"}. Eventually I want to split the pairs. I was thinking I can make a list of {"1,2", “3,4","5,6"} and use the for loop to split the pairs. Is this approach correct to get the x and y separately?

No, it is not. Use a JSON parser. See it here: http://stackoverflow.com/a/4935684/460557 — Jorge Campos, Mar 07 '16 at 19:28
You do not need a regex for that kind of string. You need to strip the initial `[`s and trailing `]`s, then split with `],[`. What is the programming language? — Wiktor Stribiżew, Mar 07 '16 at 20:48
Thank you so much for the help. I am using Java. But don't I need a regex to do the spiting. below is part of the code I used, but the problem is there are two opening and closing brackets which are not part of the pattern. So I was wondering if there is a way to remove those brackets first so I can use this method of spiting. String str = branch; String delimiters = ("((\\]|\\[)\\,?)"); String[] tokensVal = str.split(delimiters); for (String token : tokensVal){ System.out.print(token); — Mimi, Mar 09 '16 at 00:30

score 1 · Answer 1 · edited Jun 20 '20 at 09:12

JSON is not a regular language, but a Context free language, and as such, cannot be matched by a regular expresion. You need a full JSON parser like the ones referenced in the comments to your question.

... but, if you are going to have a fixed structure, like only three levels of square brakets only, and with the structure you posted in your question, then there's a regexp that can parse it (It would be a subset of the JSON grammar, not general enough to parse other JSON contents):

You'll have numbers: ([+-]?[0-9]+)

Then you'll have brackets and separators: \[\[\[, ,, \],\[ and \]\]\]

and finally, put all this together:

\[\[\[([+-]?[0-9]+),([+-]?[0-9]+)\],\[([+-]?[0-9]+),([+-]?[0-9]+)\],\[([+-]?[0-9]+),([+-]?[0-9]+)\]\]\]

and if you want to permit spaces between symbols, then you need:

\s*\[\s*\[\s*\[\s*([+-]?\d+)\s*,\s*([+-]?\d+)\s*\]\s*,\s*\[\s*([+-]?\d+)\s*,\s*([+-]?\d+)\s*\]\s*,\s*\[\s*([+-]?\d+)\s*,\s*([+-]?\d+)\s*\]\s*\]\s*\]\s*

This regexp will have six matching groups that will match the corresponding integers in the matching string as the folloging demo

Clarification

Regular languages, and regular grammars, and regular expressions form a class of languages with many practical properties, for example:

You can parse them efficiently in one pass with what is called a finite automaton
You can define the automaton to accept language sentences simply with a regular expression.
You can simply operate with regexps (or with automata) to make more complex acceptors (for the union of language sets, intersection, symmetric difference, concatenation, etc) to make acceptors for them.
You can simply say if one regular expression (the language it defines) is a subset, superset or none of the language of the original.

By contrast, it limits the power of languages that can be defined with it:

you cannot define languages that allow nesting of subexpressions (like the bracketing you allow in JSON expressions or the tag nesting allowed in XML documents)
you cannot define languages which collect context and use it in another place of the sentence (for example, sentences that identify a number and have to match that same number in another place of the sentence)

But, the meaning of my answer is that, if you bind the upper limit of nesting (let's say, for example, to three levels of parenthesis, like the example you posted) you can make your language regular and then parse it with the regular expression. It is not easy to do that, because this often leads to complex expressions (as you have seen in my answer) but not impossible, and you'll gain the possibility of being able to identify parts of the sentence as submatches of the regular subexpressions embedded in the global one.

If you want to allow nesting, you need to switch to context free languages, which are defined with context free grammars and are accepted with a more complex stack based automaton. Then, you loose the complete set of operations you had:

You'll never be able again to say if some language overlaps another (is included)
You'll never be abla again to construct a language from the union, intersection or difference of other context free languages.

But you will be able to match unbounded nested sentences. Normally, programming languages are defined with a context free grammar and a little more work for context checking (for example, to check if some identifier being used is actually defined in the declaration section or to match the starting and ending tag identifiers at matching levels in an XML document)

For context free languages, see this.
For regular languages, see this.

Second clarification

As in your question you didn't expressed you wanted to match real, decimal numbers, I have modified the demo to make it to allow fixed point numbers (not general floating point with exponential notation, you'll need to work it yourself, as an exercise). Just make some tests and modify the regexp to adapt it to your needs.

(well, if you want to see the solution, look at it)

Thank you and I tried this approach but I don't quite get it. I add more clarification as an answer since the comment box does not allow me to add image. Thank you once again — Mimi, Mar 10 '16 at 07:23

Mimi · Answer 2 · 2016-03-10T07:31:04.690

Yeah i tried using the regex in my code but it is not working so I am trying a different approach now. I have an idea of how to approach it but it is not really working. First of let me be more clear on the question. What I am trying to so parse a JSON document. Like the image below. the file has a strings have [[[1,2],[3,4][5,6]]] pattern. What I am trying to get out of this is to have each pair as a list. So the list has an x-y pairs. the string structure

My approach: first replace the “[[“ and “]]” at the begging and at the end, so I have a string with the same pattern through out. which gives [enter image description here][2]me a string “[1,2],[3,4][5,6]” This is my code but it is not working. How do I fix it? The other thing I though it could be an issue is, the strings are not the same length so. So how do I replace just the beginning and the ending?

my code

Then I can use a regex split method to get a list that has a form {“1,2” , “3,4”, “5,6”}. I am not really sure how to do this though.

Then I take the x, and the y, and add them and add those to the list. So I get of a list pair x-y pair. I will appreciate if you show me how to do this.

This is the approach I am working on but if there is a better way of doing it I will be glad to see it. [enter image description here][4]

Well, you asked for, with an example that only included integer numbers, and now you have an example with full real numbers. The regexps are different, of course. See my second edit to the answer for a working demo. — Luis Colorado, Mar 19 '16 at 08:15
By the way, the first thing I say about your question is that **JSON is not a regular language** but a context free one. There are plenty of JSON parsers on the market (and opensource) to solve the problem. Just dig a little. Yes, you can solve your problem (or get more trouble) using regexps, that depends on you. — Luis Colorado, Mar 19 '16 at 08:32

splitting JSON string using regex

2 Answers2

Clarification

Second clarification