1

I am trying to use java.util.Scanner to tokenize an arithmetic expression, where the delimiters can either be:

  • Whitespace (\s+ or \p{Space}+), which should be discarded
  • Punctation (\p{Punct}), which should be returned as tokens

Example

Given this expression:

12 + (ab-bc*3)

I would like Scanner to return these tokens:

  • 12
  • +
  • (
  • ab
  • -
  • bc
  • *
  • 3
  • )

Code

So far, I have only been able to:

  • Eat up all of the punctation characters (not what I wanted):
    • new Scanner("12 + (ab-bc*3)").useDelimiter("\\p{Space}+|\\p{Punct}").tokens().collect(Collectors.toList())
    • Result: "12", "", "", "", "ab", "bc", "3"
  • Achieve partial success using positive lookahead
    • new Scanner("12 + (ab-bc*3)").useDelimiter("\\p{Space}+|(?=\\p{Punct})").tokens().collect(Collectors.toList())
    • Result: "12", "+", "(ab", "-bc", "*3", ")"

But now I am stuck.

Danilo Piazzalunga
  • 7,590
  • 5
  • 49
  • 75
  • 2
    You could match them all using `"\\p{Punct}|\\w+"` regex – Wiktor Stribiżew Oct 30 '19 at 10:59
  • Unfortunately, this regex ate all my tokens: `new Scanner("12 + (ab-bc*3)").useDelimiter("\\p{Punct}|\\w+").tokens()` returns only empty strings – Danilo Piazzalunga Oct 30 '19 at 11:08
  • 3
    I said *matching* them all, in Scanner, you *split* with the pattern. See [Java demo](https://ideone.com/ED5EY1). – Wiktor Stribiżew Oct 30 '19 at 11:11
  • 1
    I'd say you wouldn't get what you want because you're trying to only specify the kind of delimiter you want, but you never say what kind of token you want. As far as scanner is concerned, there are no delimiters in `"-bc"`, and I don't think there's any possible configuration to change that (technically, a delimiter in there is an "empty char", which isn't actually a thing). You need to say what kinds of tokens you want, by changing `.token()` to `.findAll` with proper regex, like what Wiktor suggested. – M. Prokhorov Oct 30 '19 at 11:13
  • @WiktorStribiżew you're right, it worked! `Pattern.compile("\\p{Punct}|\\w+").matcher("12 + (ab-bc*3)").results().map(MatchResult::group).collect(Collectors.toList())` returns `"12", "+", "(", "ab", "-", "bc", "*", "3", ")"` – Danilo Piazzalunga Oct 30 '19 at 11:24
  • 1
    @DaniloPiazzalunga, you may want to save the pattern somewhere in real app, to not keep creating and parsing regex for the same thing. – M. Prokhorov Oct 30 '19 at 12:16

1 Answers1

4

A matching approach allows you to use a much simpler regex here:

String text = "12 + (ab-bc*3)";
List<String> results = Pattern.compile("\\p{Punct}|\\w+").matcher(text)
    .results()
    .map(MatchResult::group)
    .collect(Collectors.toList());
System.out.println(results); 
// => "12", "+", "(", "ab", "-", "bc", "*", "3", ")"

See Java demo.

The regex matches

  • \p{Punct} - punctuation and symbol chars
  • | - or
  • \w+ - 1+ letters, digits or _ chars.

See the regex demo (converted to PCRE for the demo purpose).

Wiktor Stribiżew
  • 607,720
  • 39
  • 448
  • 563