Apache Commons to the rescue!
import org.apache.commons.text.StringTokenizer
import org.apache.commons.text.matcher.StringMatcher
import org.apache.commons.text.matcher.StringMatcherFactory
@Grab(group='org.apache.commons', module='commons-text', version='1.3')
def str = /is this 'completely "impossible"' or """slightly"" impossible" to parse?/
StringTokenizer st = new StringTokenizer( str )
StringMatcher sm = StringMatcherFactory.INSTANCE.quoteMatcher()
st.setQuoteMatcher( sm )
println st.tokenList
Output:
[is, this, completely "impossible", or, "slightly" impossible, to, parse?]
A few notes:
- this is written in Groovy... it is in fact a Groovy script. The
@Grab
line gives a clue to the sort of dependency line you need
(e.g. in build.gradle
) ... or just include the .jar in your
classpath of course
StringTokenizer
here is NOT
java.util.StringTokenizer
... as the import
line shows it is
org.apache.commons.text.StringTokenizer
- the
def str = ...
line is a way to produce a String
in Groovy which contains both
single quotes and double quotes without having to go in for escaping
StringMatcherFactory
in apache commons-text 1.3 can be found
here: as you can see, the INSTANCE
can provide you with a
bunch of different StringMatcher
s. You could even roll your own:
but you'd need to examine the StringMatcherFactory
source code to
see how it's done.
- YES! You can not only include the "other type of quote" and it is correctly interpreted as not being a token boundary ... but you can even escape the actual quote which is being used to turn off tokenising, by doubling the quote within the tokenisation-protected bit of the String! Try implementing that with a few lines of code ... or rather don't!
PS why is it better to use Apache Commons than any other solution?
Apart from the fact that there is no point re-inventing the wheel, I can think of at least two reasons:
- The Apache engineers can be counted on to have anticipated all the gotchas and developed robust, comprehensively tested, reliable code
- It means you don't clutter up your beautiful code with stoopid utility methods - you just have a nice, clean bit of code which does exactly what it says on the tin, leaving you to get on with the, um, interesting stuff...
PPS Nothing obliges you to look on the Apache code as mysterious "black boxes". The source is open, and written in usually perfectly "accessible" Java. Consequently you are free to examine how things are done to your heart's content. It's often quite instructive to do so.
later
Sufficiently intrigued by ArtB's question I had a look at the source:
in StringMatcherFactory.java we see:
private static final AbstractStringMatcher.CharSetMatcher QUOTE_MATCHER = new AbstractStringMatcher.CharSetMatcher(
"'\"".toCharArray());
... rather dull ...
so that leads one to look at StringTokenizer.java:
public StringTokenizer setQuoteMatcher(final StringMatcher quote) {
if (quote != null) {
this.quoteMatcher = quote;
}
return this;
}
OK... and then, in the same java file:
private int readWithQuotes(final char[] srcChars ...
which contains the comment:
// If we've found a quote character, see if it's followed by a second quote. If so, then we need to actually put the quote character into the token rather than end the token.
... I can't be bothered to follow the clues any further. You have a choice: either your "hackish" solution, where you systematically pre-process your strings before submitting them for tokenising, turning |\\"|s into |""|s... (i.e. where you replace each |"| with |""|)...
Or... you examine org.apache.commons.text.StringTokenizer.java to figure out how to tweak the code. It's a small file. I don't think it would be that difficult. Then you compile, essentially making a fork of the Apache code.
I don't think it can be configured. But if you found a code-tweak solution which made sense you might submit it to Apache and then it might be accepted for the next iteration of the code, and your name would figure at least in the "features request" part of Apache: this could be a form of kleos through which you achieve programming immortality...