I want to remove all the words which are duplicate from a file using regex.
For example:
The university of Hawaii university began using began radio.
Output:
The university of Hawaii began using radio.
I wrote this regex:
String regex = "\\b(\\p{IsAlphabetic}+)(\\s+\\1\\b)+";
Which is removing only words which are going in a row after word.
For example:
The university university of Hawaii Hawaii began using radio.
Output: The university of Hawaii began using radio.
My code with regex:
File dir = new File("C:/Users/Arnoldas/workspace/uplo/");
String source = dir.getCanonicalPath() + File.separator + "Output.txt";
String dest = dir.getCanonicalPath() + File.separator + "Final.txt";
File fin = new File(source);
FileInputStream fis = new FileInputStream(fin);
BufferedReader in = new BufferedReader(new InputStreamReader(fis, "UTF-8"));
//FileWriter fstream = new FileWriter(dest, true);
OutputStreamWriter fstream = new OutputStreamWriter(new FileOutputStream(dest, true), "UTF-8");
BufferedWriter out = new BufferedWriter(fstream);
String regex = "\\b(\\p{IsAlphabetic}+)(\\s+\\1\\b)+";
//String regex = "(?i)\\b([a-z]+)\\b(?:\\s+\\1\\b)+";
Pattern p = Pattern.compile(regex, Pattern.CASE_INSENSITIVE);
String aLine;
while ((aLine = in.readLine()) != null) {
Matcher m = p.matcher(aLine);
while (m.find()) {
aLine = aLine.replaceAll(m.group(), m.group(1));
}
//Process each line and add output to *.txt file
out.write(aLine);
out.newLine();
out.flush();
}