How to remove duplicate words (words are going not in a row) in file using regex?

Question

I want to remove all the words which are duplicate from a file using regex.

For example:

 The university of Hawaii university began using began radio.

Output:

 The university of Hawaii began using radio.

I wrote this regex:

 String regex = "\\b(\\p{IsAlphabetic}+)(\\s+\\1\\b)+";

Which is removing only words which are going in a row after word.

For example: The university university of Hawaii Hawaii began using radio.

Output: The university of Hawaii began using radio.

My code with regex:

File dir = new File("C:/Users/Arnoldas/workspace/uplo/");

            String source = dir.getCanonicalPath() + File.separator + "Output.txt";
            String dest = dir.getCanonicalPath() + File.separator + "Final.txt";

            File fin = new File(source);
            FileInputStream fis = new FileInputStream(fin);
            BufferedReader in = new BufferedReader(new InputStreamReader(fis, "UTF-8"));

            //FileWriter fstream = new FileWriter(dest, true);
            OutputStreamWriter fstream = new OutputStreamWriter(new FileOutputStream(dest, true), "UTF-8");

            BufferedWriter out = new BufferedWriter(fstream);

            String regex = "\\b(\\p{IsAlphabetic}+)(\\s+\\1\\b)+";

            //String regex = "(?i)\\b([a-z]+)\\b(?:\\s+\\1\\b)+";
            Pattern p = Pattern.compile(regex, Pattern.CASE_INSENSITIVE);

            String aLine;
            while ((aLine = in.readLine()) != null) {

                Matcher m = p.matcher(aLine);
                while (m.find()) {
                    aLine = aLine.replaceAll(m.group(), m.group(1));
                }

                //Process each line and add output to *.txt file
                out.write(aLine);
                out.newLine();
                out.flush();
            }

You cannot do this with regex. – Andy Turner May 23 '18 at 12:46 — Andy Turner, May 23 '18 at 12:46

score 0 · Answer 1 · answered May 23 '18 at 12:56

You could use Streams instead:

String s = "The university university of Hawaii Hawaii began using radio.";
System.out.println(Arrays.asList(s.split(" ")).stream().distinct().collect(Collectors.joining(" ")));

In this example the String is split along the blanks, than transformed to a stream. Duplicates are removed with distinct() and at the end all ist joined together with spaces between.

But this approach has a problem with the dot at the end. "radio" and "radio." are different words.

Joop Eggen · Answer 2 · 2018-05-23T14:03:42.120

You were on the right track, but if between the repetitions there can be text it must be done in a loop (for "began ... began ... began").

String s = "The university of Hawaii university began using began radio.";
for (;;) {
    String t = s.replaceAll("(?i)\\b(\\p{IsAlphabetic}+)\\b(.*?)\\s*\\b\\1\\b",
                            "$1$2");
    if (t.equals(s)) {
        break;
    }
    s = t;
}

For case-insensitive replace: use (?i).

This is very inefficient as the regex must backtrack.

Simply throw all words in a Set.

// Java 9
Set<String> corpus = Set.of(s.split("\\P{IsAlphabetic}+"));

// Older java:
Set<String> corpus = new TreeSet<>();
Collections.addAll(set, s.split("\\P{IsAlphabetic}+"));

corpus.remove("");

After comment

Correction of original code
New style I/O using Files and Path, still no streams though
Try-with-resources for automatic closing in and out

Regex only to find a word with optional whitespace. Using a set to check duplicates.

    Path dir = Paths.get("C:/Users/Arnoldas/workspace/uplo");
    Path source = dir.resolve("Output.txt");
    String dest = dir.resolve("Final.txt");

    String regex = "(\\s*)\\b\\(p{IsAlphabetic}+)\\b";
    Pattern p = Pattern.compile(regex, Pattern.CASE_INSENSITIVE);

    try (BufferedReader in = Files.newBufferedReader(source);
            BufferedWriter out = new BufferedWriter(dest)) {
        String line;
        while ((line = in.readLine()) != null) {
            Set<String> words = new HashSet<>();
            Matcher m = p.matcher(line);
            StringBuffer sb = new StringBuffer();
            while (m.find()) {
                boolean added = words.add(m.group(2).toLowerCase());
                m.appendReplacement(sb, added ? m.group() : "");
            }
            m.appendTail(sb);
            out.write(sb.toString());
            out.newLine();
        }
    }

It seems nice, but could you take a look to my code? I am confused, how i could change it that set would work to me? @JoopEggen — Arnas Arnelis, May 23 '18 at 13:38
The search already is case insensitive. The Set not, but I add toLowerCased words. (One could also make a case-insensitive set by adding a Comparator to the constructor.) — Joop Eggen, May 23 '18 at 14:01
@ArnasArnelis sorry misunderstood last comment, case-insensitive with `String.replaceAll` works with `"(?i)"` - see answer. — Joop Eggen, May 23 '18 at 14:05

score 0 · Answer 3 · answered May 23 '18 at 12:57

0

Try this regular expression:

\b(\w+)\s+\1\b
Here \b is a word boundary and \1 references the captured match of the first group.

Source : Regular Expression For Consecutive Duplicate Words

answered May 23 '18 at 12:57

Adya

1,084
9
17

my regex string is regular expression but i needed \w+ to change that it could read lithuanian symbols – Arnas Arnelis May 23 '18 at 13:15
luthaninan symbols means? – Adya May 23 '18 at 13:26
Ąčęėįšųūž this is symbols of Lithuanian alphabet – Arnas Arnelis May 23 '18 at 13:28
Then I think you need to add checks for all these symbols specifically – Adya May 23 '18 at 14:43

How to remove duplicate words (words are going not in a row) in file using regex?

3 Answers3