3

How can I format my regex to allow this?

Here's the regular expression: "\\b[(\\w'\\-)&&[^0-9]]{4,}\\b"

It's looking for any word that is 4 letters or greater.

If I want to split, say, an article, I want an array that includes all the delimited values, plus all the values between them, all in the order that they originally appeared in. So, for example, if I want to split the following sentence: "I need to purchase a new vehicle. I would prefer a BMW.", my desired result from the split would be the following, where the italicized values are the delimiters.

"I ", "need", " to ", "purchase", " a new ", "vehicle", ". I ", "would", " ", "prefer", "a BMW."

So, all words with >4 characters are one token, while everything in between each delimited value is also a single token (even if it is multiple words with whitespace). I will only be modifying the delimited values and would like to keep everything else the same, including whitespace, new lines, etc.

I read in a different thread that I could use a lookaround to get this to work, but I can't seem to format it correctly. Is it even possible to get this to work the way I'd like?

Community
  • 1
  • 1
aakbari1024
  • 167
  • 1
  • 3
  • 9
  • Your requirement isn't self-consistent. Why are "a new", ". I", and "a BMW." each considered one token? Why is the space after "would" considered a separate token? And why do you need the whitespace? Why is the white space before some tokens, after others, and separate from still others? – user207421 Nov 13 '13 at 01:35
  • Everything between delimiters is one token. So " a new " is one token because it's between the delimited values "purchase" and "vehicle." I want to keep all whitespace and punctuations because the only values I'm modifying from the resulting String array will be the delimited values. – aakbari1024 Nov 13 '13 at 01:45
  • 1
    Just to conform if I understand it fine: "In an article your delimiter is any word greater than 4 letters? and you then want to split the article on all these words and store them in an ordered fashion? " – Mukul Goel Nov 13 '13 at 01:55
  • 1
    Sorry but I am not sure what you are trying to do. What is delimiter and what is token? You are saying that `italicized values are the delimiters`, but next you are saying say that `words with >4 characters are one token`. – Pshemo Nov 13 '13 at 01:58
  • Yes, I want to split the String so that any word that is four letters or greater is it's own token, and everything (including whitespace, new lines, multiple words that are 3 letters or less) in between two occurrences of these is also it's own token. And I would like the resulting String array to have all of these tokens, in the same order as the original String. – aakbari1024 Nov 13 '13 at 01:58
  • @Pshemo, in the example array I posted, I've italicized the words that are the delimiters, to make it easier to see how the String should be split. I want to include those delimiters in the resulting array of the split. – aakbari1024 Nov 13 '13 at 02:02
  • 1
    @What is the problem them? Just use `split()` and pass a regex for `a word with 4 or more letters` ? – Mukul Goel Nov 13 '13 at 02:02
  • So in fact there are no delimiters at all, only the 4-character rule? You're not making much sense here, you keep changing your mind. I'm wondering what the purpose is here. A simpler definition of the problem would be simpler to implement as well. – user207421 Nov 13 '13 at 02:03
  • @MukulGoel: Because split() does not include delimiters in the resulting array. – aakbari1024 Nov 13 '13 at 02:03
  • @aakbari1024 : You would need to write your own implementation in such a case. A possible implementation could be to `1 – Mukul Goel Nov 13 '13 at 02:08

2 Answers2

3

I am not sure what you are trying to do but just in case that you want to modify words that have at least four letters you can use something like this (it will change words with =>4 letters to its upper cased version)

String data = "I need to purchase a new vehicle. I would prefer a BMW.";
Pattern patter =  Pattern.compile("(?<![a-z\\-_'])[a-z\\-_']{4,}(?![a-z\\-_'])",
        Pattern.CASE_INSENSITIVE);
Matcher matcher = patter.matcher(data);

StringBuffer sb = new StringBuffer();// holder of new version of our
                                        // data
while (matcher.find()) {// lets find all words
    // and change them with its upper case version
    matcher.appendReplacement(sb, matcher.group().toUpperCase());
}
matcher.appendTail(sb);// lets not forget about part after last match

System.out.println(sb);

Output:

I NEED to PURCHASE a new VEHICLE. I WOULD PREFER a BMW.

OR if you change replacing code to something like

matcher.appendReplacement(sb, "["+matcher.group()+"]");

you will get

I [need] to [purchase] a new [vehicle]. I [would] [prefer] a BMW.

Now you can just split such string on every [ and ] to get your desired array.

Pshemo
  • 122,468
  • 25
  • 185
  • 269
  • I like your last part "Now you can just split such string on every `[` and `]` to get your desired array" – justhalf Nov 13 '13 at 02:21
  • This actually helps me a lot. Thanks a bunch, really appreciate it. – aakbari1024 Nov 13 '13 at 02:24
  • 2
    I have mixed feelings regarding this approach because if there will be `[` or `]` in original string split will also use them as separator. Probably better choice would be using something that will never appear in original string like `<<<>>>`. – Pshemo Nov 13 '13 at 02:26
1

Assuming that "word" is defined as [A-Za-z], you can use this regex:

(?<=(\\b[A-Za-z]{4,50}\\b))|(?=(\\b[A-Za-z]{4,50}\\b))

Full code:

class RegexSplit{
    public static void main(String[] args){
        String str = "I need to purchase a new vehicle. I would prefer a BMW.";
        String[] tokens = str.split("(?<=(\\b[A-Za-z]{4,50}\\b))|(?=(\\b[A-Za-z]{4,50}\\b))");
        for(String token: tokens){
            System.out.print("["+token+"]");
        }
        System.out.println();
    }
}

to get this output:

[I ][need][ to ][purchase][ a new ][vehicle][. I ][would][ ][prefer][ a BMW.]
justhalf
  • 8,960
  • 3
  • 47
  • 74