How can I scan a file with no delimiter between tokens in Java?

Question

I have text input which looks like this:

!10#6#4!3#4

I have two patterns for the two types of data found in the input above:

Pattern totalPattern = Pattern.compile("![0-9]+");
Pattern valuePattern = Pattern.compile("#[0-9]+");

I wanted to get the following output from the input above:

total value value total value

So I wrote the following code to solve this problem:

try (Scanner scanner = new Scanner(inputFile)) {
    while (scanner.hasNext()) {
        if (scanner.hasNext(totalPattern) {
            System.out.print("total ");
            scanner.next(totalPattern);
        } else if (scanner.hasNext(valuePattern)) {
            System.out.print("value ");
            scanner.next(valuePattern);
        }
    }
}

But this obviously did not work because it couldn't separate the input into tokens because there are no delimiters in the input. The scanner sees the entire input as one big single token. I know I could change the input to have spaces before the # and !, but I'm not allowed to change the input for this problem. I tried messing with the scanner to change the delimiter but couldn't get it to work. I kind of think that the Scanner isn't the right tool for this job, but I'm not really sure what is.

Edit

As g00se pointed out, I could use ! and # as delimiters. In the above example that wouldn't be a problem, but to change the problem slightly, suppose the input was:

!!!10#6###4!!3#4

The patterns would be:

Pattern totalPattern = Pattern.compile("[!]+[0-9]+");
Pattern valuePattern = Pattern.compile("[#]+[0-9]+");

If I wanted to use the number of ! or # as part of the information encoded in the data, using ! and # as delimiters doesn't work. I still think Scanner is not the right tool, I just don't know what is.

Edit 2

Okay, I realize that I didn't really write this question in the right way and I have waisted some time. Ultimately I wanted a solution that is more general purpose. The solution should take in a series of patterns, and consume the input from the front and return a series of tokens that it can break the input into. The two patters from above are just examples. I realize now though that because I didn't specify the type of solution I wanted, I got some great answers which solved the problem given above in the simplest way possible.

```scanner.useDelimiter("[!#]");``` springs to mind. Untested — g00se, Oct 15 '22 at 17:09
@g00se the problem with that is I think it will take away the `!` from the value of `scanner.next(totalPattern)` which in this toy example is not a problem, but will be a problem for me down the road. — redmoncoreyl, Oct 15 '22 at 17:17
Well I'm mystified ;) It would be the first token in the line and you know it's always there don't you - or maybe it isn't? — g00se, Oct 15 '22 at 17:19
I wouldn't be using a Scanner at all. Open the input file, read the whole content, and then parse it. — access violation, Oct 15 '22 at 17:28
The problem we're facing is that you're drip feeding the real problem which has the effect of changing the goalposts. Ultimately that's just wasting time — g00se, Oct 15 '22 at 17:32
It sounds like you just want to iterate over all matches of the regex `([!]+|[#]+)[0-9]+` in the input file. If that's the case, `Scanner` is the wrong tool, you can just do `Pattern.compile(...).matcher(text)` and call `.find` in a loop. — kaya3, Oct 15 '22 at 17:37
If you have a non-trivial lexical analysis problem, consider using a tool designed to help you with non-trivial lexical analysis problems. Like, for example, [JFlex](https://www.jflex.de/). I'm sure there are others; I'm not very familiar with the Java environment. — rici, Oct 16 '22 at 21:21

score 1 · Answer 1 · answered Oct 15 '22 at 18:15

To parse the example strings you provided, you could do something like this:

String s = "!!!10#6###4!!3#4";
    
StringBuilder sb = new StringBuilder("");
    
/* Iterate through the tags after spliting the 
   string using the numbers as delimiter.   */
for (String tag : s.split("\\d{1,}")) {
    /* Replace any one (or more in sequence) ! with 
       the word "total " in the current tag element
       and on the same hand, if the current tag element
       contains # replace any one (or more in sequence) 
       # with the word "value "                    */
    sb.append(tag.replaceAll("[!]{1,}", "total ").replaceAll("[#]{1,}", "value "));
}
    
String output = sb.toString().trim();
System.out.println(output);

If s contained !10#6#4!3#4 then the console display will show total value value total value. If the variable s contained !!!10#6###4!!3#4 then again, the console display will show total value value total value.

score 0 · Answer 2 · answered Oct 16 '22 at 17:39

What I have described as my desired solution in the last edit is called lexical analysis.

I decided that the best choice was to take inspiration from this SO post for making a Java Lexer.

Additionally, as this SO answer describes, lexing large files could be a problem, so using a buffer that I can add on to when a match isn't found might be a good idea also.

How can I scan a file with no delimiter between tokens in Java?

2 Answers2