Java match string until the first occurence of a different string

Question

I am facing other issues with Java's matcher. I am trying to match the content of my JSON using regex, without using external libraries. My JSON looks like this:

[
{
"FIRST":"Tom",
"LAST":"Hanks",
"SUFFIX":""
},
{
"FIRST":"Sammy",
"LAST":"Davis",
"SUFFIX":"Jr."
}
]

however, I only want to match the words in the first half, i.e. before the first }, match. I tried to create a new pattern and matcher but I don't know how to progress, and how to break the while loop after the first occurence of },

Pattern pattern = Pattern.compile("\"([^\"]*)\"");
Matcher matcher = pattern.matcher(loadJSONtoString(fileName).toString());
Pattern pattern2 = Pattern.compile("},");
Matcher matcher2 = pattern2.matcher(loadJSONtoString(fileName).toString());
while(matcher.find())
   {
         myList.add(matcher.group());
         //if matcher2 encounters }. for the first time: break;  
   }

Do not use regex. Use a JSON parser. Not using libraries doesn’t make sence. — Jens, Mar 13 '21 at 21:11
@Jens is there a core library that is used for JSON parsing? — Sam333, Mar 13 '21 at 21:12
I explicitly said I cannot use any external libraries, core libraries are libraries which come together with JDK, you can take a look: https://openjdk.java.net/groups/core-libs/ — Sam333, Mar 13 '21 at 21:17

Willis Blackburn · Answer 1 · 2021-03-13T21:23:46.640

Because you specifically said that you don't want to use external libraries, I'll assume that you realize that parsing JSON with regular expressions is taking the long way home and that you should really use a JSON-parsing library.

With that out of the way, consider how the JSON-parsing library itself works. It actually does use regular expressions. But instead of one regular expression, it has many of them, each one designed to recognize certain elements of the JSON syntax: a quoted string, a number, an open or closing brace, etc. The parser defines the JSON syntax in terms of these regular expressions and things built from them: a "field" is a quoted string followed by a colon followed by a value, a "value" is a quoted string or a number or an array or an object, an "object" is a open brace followed by zero or more fields followed by a closing brace, etc.

A good way to write very simple parser using regular expressions is to define a regular expression that returns "tokens." You define each token as a capturing group and separate them with '|' so the regular expression can match any of them. You determine which token it found by checking which capture group matched. (Only one of them will match.) Then repeatedly use the regex to return tokens until you get the one you want.

Here's a simple regex that would probably work for you:

Pattern pattern = Pattern.compile("\"([^\\\"]*)\"|({)|(})|(\\[)|(\\])|(:)|,");

The match groups are:

A quoted string
An open brace
A close brace
An open square bracket
A close square bracket
A colon

Note that there's no capturing group for the comma, because it's kind of noise and we don't really care about it that much, and the part of the regex that matches the quoted string excludes the quotes themselves, so you don't have to remove the quotes during parsing.

You need to build a function that applies this regular expression to the rest of the string (that is, whatever you haven't yet parsed), then loops over the match group and returns the number of the match group and also its content. For some matches the content won't be that interesting (if it matched the colon, the content will always be the colon), but for the quoted string, the content will be the string itself.

Let's say there's a class called JsonScanner that has two methods:

getNextToken applies the regex and returns the number of the token (that is, the number of the match group).
getContent returns the content from the last match.

Now you can repeatedly call these:

JsonScanner scanner(text);
scanner.getNextToken(); // should return '[' token
scanner.getNextToken(); // should return '{' token
scanner.getNextToken(); // should return quoted string
scanner.getContent(); // will return "FIRST"
etc.

You should actually check that the return values are what you expect and throw an error if they aren't.

Once you read the first object (scanner.getNextToken() returns the '}' token), you can stop scanning.

Note that this is a very simple implementation. The quoted string regex doesn't handle escaped characters inside the string, for example. And if you wanted to actually validate the JSON, you'd have to return the comma token too and make sure commas were used properly. But this is the general idea. I've written simple parsers using this exact strategy myself.

J2EE actually contains a class called JsonParser that basically implements this strategy for you. It contains a next method that returns an Event which identifies the token and also the content.

Here's someone who's using a similar approach, except that instead of using a single regex with a bunch of capture groups, the person is using a bunch of separate regexes and applying each one in turn until one matches.

Hi, thank you for your answer. In your code, did you mean `Scanner scanner = new Scanner(string);` with this, I am trying to call the `getNextToken` method but it seems like the scanner doesn't recognize this method — Sam333, Mar 13 '21 at 21:19
`Scanner` is a class that you have to write. :-) You construct it with the original JSON text. Every time you call `getNextToken` you apply the regex and return the number of the match group. Store the matched content in a field and return it from `getContent`. I'll change the name so it's obvious that it's not the built-in `java.util.Scanner`. — Willis Blackburn, Mar 13 '21 at 21:23

Java match string until the first occurence of a different string

1 Answers1