1

My input:

 1. end 
 2. end of the day or end of the week 
 3. endline
 4. something 
 5. "something" end

Based on the above discussions, If I try to replace a single string using this snippet, it removes the appropriate words from the line successfully

public class DeleteTest {

    public static void main(String[] args) {

        // TODO Auto-generated method stub
        try {
        File file = new File("C:/Java samples/myfile.txt");
        File temp = File.createTempFile("myfile1", ".txt", file.getParentFile());
        String delete="end";
        BufferedReader reader = new BufferedReader(new InputStreamReader(new FileInputStream(file)));
        PrintWriter writer = new PrintWriter(new OutputStreamWriter(new FileOutputStream(temp)));

        for (String line; (line = reader.readLine()) != null;) {
            line = line.replaceAll("\\b"+delete+"\\b", "");
       writer.println(line);
        }
        reader.close();
        writer.close();
        }
        catch (Exception e) {
            System.out.println("Something went Wrong");
        }
    }
}

My output If I use the above snippet:(Also my expected output)

 1.  
 2. of the day or of the week
 3. endline
 4. something
 5. "something"

But when I include more words to delete, and for that purpose when I use Set, I use the below code snippet:

public static void main(String[] args) {

    // TODO Auto-generated method stub
    try {

    File file = new File("C:/Java samples/myfile.txt");
    File temp = File.createTempFile("myfile1", ".txt", file.getParentFile());
    BufferedReader reader = new BufferedReader(new InputStreamReader(new FileInputStream(file)));
    PrintWriter writer = new PrintWriter(new OutputStreamWriter(new FileOutputStream(temp)));

        Set<String> toDelete = new HashSet<>();
        toDelete.add("end");
        toDelete.add("something");

    for (String line; (line = reader.readLine()) != null;) {
        line = line.replaceAll("\\b"+toDelete+"\\b", "");
    writer.println(line);
    }
    reader.close();
    writer.close();
    }
    catch (Exception e) {
        System.out.println("Something went Wrong");
    }
}

I get my output as: (It just removes the space)

 1. end
 2. endofthedayorendoftheweek
 3. endline
 4. something
 5. "something" end 

Can u guys help me on this?

Click here to follow the thread

Kohei TAMURA
  • 4,970
  • 7
  • 25
  • 49
venk
  • 55
  • 8

2 Answers2

1

You need to create an alternation group out of the set with

String.join("|", toDelete)

and use as

line = line.replaceAll("\\b(?:"+String.join("|", toDelete)+")\\b", "");

The pattern will look like

\b(?:end|something)\b

See the regex demo. Here, (?:...) is a non-capturing group that is used to group several alternatives without creating a memory buffer for the capture (you do not need it since you remove the matches).

Or, better, compile the regex before entering the loop:

Pattern pat = Pattern.compile("\\b(?:" + String.join("|", toDelete) + ")\\b");
...
    line = pat.matcher(line).replaceAll("");

UPDATE:

To allow matching whole "words" that may contain special chars, you need to Pattern.quote those words to escape those special chars, and then you need to use unambiguous word boundaries, (?<!\w) instead of the initial \b to make sure there is no word char before and (?!\w) negative lookahead instead of the final \b to make sure there is no word char after the match.

In Java 8, you may use this code:

Set<String> nToDel = new HashSet<>();
nToDel = toDelete.stream()
    .map(Pattern::quote)
    .collect(Collectors.toCollection(HashSet::new));
String pattern = "(?<!\\w)(?:" + String.join("|", nToDel) + ")(?!\\w)";

The regex will look like (?<!\w)(?:\Q+end\E|\Qsomething-\E)(?!\w). Note that the symbols between \Q and \E is parsed as literal symbols.

Wiktor Stribiżew
  • 607,720
  • 39
  • 448
  • 563
  • Thanks Wiktor. can you suggest a way to add things such as parenthesis and special characters inside regex pattern? – venk May 08 '17 at 09:11
  • @venk: Do you mean you will have non-word (non-letter/digit/`_`) in the `toDelete` items? Then your word boundary based approach might fail to find the matches (if those non-word chars appear at the start/end as in `\b(?:+end|something-)\b`). You'd need to run `Pattern.quote` on all the items and then use `(?<!\w)` and `(?!\w)` instead of the `\b`s. – Wiktor Stribiżew May 08 '17 at 09:25
  • @venk: See the updated answer. Please consider accepting if it works for you. – Wiktor Stribiżew May 08 '17 at 09:55
  • 1
    Thanks for the solution. Worked as expected. :) – venk May 09 '17 at 04:06
0

The problem is that you're not creating the correct regex for replacing the words in the set.

"\\b"+toDelete+"\\b" will produce this String \b[end, something]\b which is not what you need.

To fix that you can do something like this:

for(String del : toDelete){
    line = line.replaceAll("\\b"+del+"\\b", "");
}

What this does is to go through the set, produce a regex from each word and remove that word from the line String.

Another approach will be to produce a single regex from all the words in the set.

Eg:

String regex = "";
for(String word : toDelete){
   regex+=(regex.isEmpty() ? "" : "|") + "(\\b"+word+"\\b)";
}
....
line = line.replace(regex, "");

This should produce a regex that looks something like this: (\bend\b)|(\bsomething\b)

Titus
  • 22,031
  • 1
  • 23
  • 33
  • Thanks Titus. can you suggest a way to add things such as parenthesis and special characters inside regex? – venk May 08 '17 at 09:27
  • @venk You will need to escape those characters using `\`. You can find more details about that [HERE](http://stackoverflow.com/a/10665057/1552587) – Titus May 08 '17 at 18:10