How can I format my regex to allow this?
Here's the regular expression:
"\\b[(\\w'\\-)&&[^0-9]]{4,}\\b"
It's looking for any word that is 4 letters or greater.
If I want to split, say, an article, I want an array that includes all the delimited values, plus all the values between them, all in the order that they originally appeared in. So, for example, if I want to split the following sentence: "I need to purchase a new vehicle. I would prefer a BMW.", my desired result from the split would be the following, where the italicized values are the delimiters.
"I ", "need", " to ", "purchase", " a new ", "vehicle", ". I ", "would", " ", "prefer", "a BMW."
So, all words with >4 characters are one token, while everything in between each delimited value is also a single token (even if it is multiple words with whitespace). I will only be modifying the delimited values and would like to keep everything else the same, including whitespace, new lines, etc.
I read in a different thread that I could use a lookaround to get this to work, but I can't seem to format it correctly. Is it even possible to get this to work the way I'd like?