Because you specifically said that you don't want to use external libraries, I'll assume that you realize that parsing JSON with regular expressions is taking the long way home and that you should really use a JSON-parsing library.
With that out of the way, consider how the JSON-parsing library itself works. It actually does use regular expressions. But instead of one regular expression, it has many of them, each one designed to recognize certain elements of the JSON syntax: a quoted string, a number, an open or closing brace, etc. The parser defines the JSON syntax in terms of these regular expressions and things built from them: a "field" is a quoted string followed by a colon followed by a value, a "value" is a quoted string or a number or an array or an object, an "object" is a open brace followed by zero or more fields followed by a closing brace, etc.
A good way to write very simple parser using regular expressions is to define a regular expression that returns "tokens." You define each token as a capturing group and separate them with '|' so the regular expression can match any of them. You determine which token it found by checking which capture group matched. (Only one of them will match.) Then repeatedly use the regex to return tokens until you get the one you want.
Here's a simple regex that would probably work for you:
Pattern pattern = Pattern.compile("\"([^\\\"]*)\"|({)|(})|(\\[)|(\\])|(:)|,");
The match groups are:
- A quoted string
- An open brace
- A close brace
- An open square bracket
- A close square bracket
- A colon
Note that there's no capturing group for the comma, because it's kind of noise and we don't really care about it that much, and the part of the regex that matches the quoted string excludes the quotes themselves, so you don't have to remove the quotes during parsing.
You need to build a function that applies this regular expression to the rest of the string (that is, whatever you haven't yet parsed), then loops over the match group and returns the number of the match group and also its content. For some matches the content won't be that interesting (if it matched the colon, the content will always be the colon), but for the quoted string, the content will be the string itself.
Let's say there's a class called JsonScanner
that has two methods:
getNextToken
applies the regex and returns the number of the token (that is, the number of the match group).
getContent
returns the content from the last match.
Now you can repeatedly call these:
JsonScanner scanner(text);
scanner.getNextToken(); // should return '[' token
scanner.getNextToken(); // should return '{' token
scanner.getNextToken(); // should return quoted string
scanner.getContent(); // will return "FIRST"
etc.
You should actually check that the return values are what you expect and throw an error if they aren't.
Once you read the first object (scanner.getNextToken()
returns the '}' token), you can stop scanning.
Note that this is a very simple implementation. The quoted string regex doesn't handle escaped characters inside the string, for example. And if you wanted to actually validate the JSON, you'd have to return the comma token too and make sure commas were used properly. But this is the general idea. I've written simple parsers using this exact strategy myself.
J2EE actually contains a class called JsonParser
that basically implements this strategy for you. It contains a next
method that returns an Event
which identifies the token and also the content.
Here's someone who's using a similar approach, except that instead of using a single regex with a bunch of capture groups, the person is using a bunch of separate regexes and applying each one in turn until one matches.