Pattern, matcher in Java, REGEX help

Question

I'm trying to just get rid of duplicate consecutive words from a text file, and someone mentioned that I could do something like this:

Pattern p = Pattern.compile("(\\w+) \\1");
StringBuilder sb = new StringBuilder(1000);
int i = 0;
for (String s : lineOfWords) { // line of words is a List<String> that has each line read in from txt file
Matcher m = p.matcher(s.toUpperCase());
// and then do something like
while (m.find()) {
  // do something here
}

I tried looking at the m.end to see if I could create a new string, or remove the item(s) where the matches are, but I wasn't sure how it works after reading the documentation. For example, as a test case to see how it worked, I did:

if (m.find()) {
System.out.println(s.substring(i, m.end()));
    }

To the text file that has: This is an example example test test test.

Why is my output This is?

Edit:

if I have an AraryList lineOfWords that reads each line from a line of .txt file and then I create a new ArrayList to hold the modified string. For example

List<String> newString = new ArrayList<String>();
for (String s : lineOfWords { 
   s = s.replaceAll( code from Kobi here);
   newString.add(s);
}

but then it doesn't give me the new s, but the original s. Is it because of shallow vs deep copy?

What's `i` in that second fragment? There is no trace of it anywhere else in the code you show... — Alex Martelli, Aug 04 '10 at 04:48
Hi, Crystal. It is best to ask a new question in that case, it really is another question on another subject. (on a relevant note - back when I studied Java it didn't have generics nor foreach loops `:P`) — Kobi, Aug 06 '10 at 09:26

Kobi · Accepted Answer · 2010-08-04T04:58:43.487

3

Try something like:

s = s.replaceAll("\\b(\\w+)\\b(\\s+\\1)+\\b", "$1");

That regex is a bit stronger than yours - it checks for whole words (no partial matches), and gets rid of any number of consecutive repetitions.
The regex captures a first word: \b(\w+)\b, and then attempts to match spaces and repetitions of that word: (\s+\1)+. The final \b is to avoid partial matching of \1, as in "for formatting".

edited Aug 04 '10 at 04:58

answered Aug 04 '10 at 04:52

Kobi

135,331
41
252
292

That helped out a lot. Is there a way to check for things that are different case? Like "test Test"? – Crystal Aug 05 '10 at 04:03
@Crystal - Thanks! You can add `(?i)` at the beginning of the regex to make it case-insensitive, it seems like the standard solution for `replaceAll`. – Kobi Aug 05 '10 at 04:16
Another question Kobi if you have a second, if I am looping through an Arraylist that has my lines of words from a test file, and if I did a foreach loop to go through it, like for (String s: lineOfWords) { s = s.replaceAll..., then how would I add this new "s" to my new ArrayList to return. I think it has to do with shallow vs deep copy, but not sure. I tried pseudo-coding in my initial question above. Thx! – Crystal Aug 06 '10 at 01:09
You mustn’t use `\b` and such in Java. They are [super-broken](http://stackoverflow.com/questions/4304928/unicode-equivalents-for-w-and-b-in-java-regular-expressions/4307261#4307261). For example, the string `élève` is not matched by the pattern `\b\w+\b` *anywhere whatsoever*. – tchrist Dec 02 '10 at 03:19
@tchrist - Hello! Yes, I've noticed you raise that unfortunate issue lately. I'll keep it in mind when Unicode support is necessary. I guess the best workaround here is not to use a monstrosity of a regex for every `\b` or `\w`, but to use a regex library that works `:P` – Kobi Dec 02 '10 at 05:14

score 1 · Answer 2 · answered Aug 04 '10 at 04:51

The first match is "ThIS IS an example...", so m.end() points to the end of the second "is". I'm not sure why you use i for the start index; try m.start() instead.

To improve your regex, use \b before and after the word to indicate that there should be word boundaries: (\\b\\w+\\b). Otherwise, as you're seeing, you'll get matches inside of words.

Pattern, matcher in Java, REGEX help

2 Answers2