I want to filter a text, leaving only letters (a-z and A-Z). It seemed to be easy, following something like this How to filter a Java String to get only alphabet characters?
String cleanedText = text.toString().toLowerCase().replaceAll("[^a-zA-Z]", "");
System.out.println(cleanedText);
The problem that the output of this is empty, unless I change the regex, adding another character, e.g. :
--> [^:a-zA-Z]
I allready tried to check if it works with normal regex (not using the method ReplaceAll given by String object in Java), but I had exactly the same problem.
Any idea what could be the source of this strange behavior?
I had a txt file which I read using a BufferedReader. I add each line to one long string and apply the code I posted before to this. The whole code is as follows:
import java.io.BufferedReader;
import java.io.FileNotFoundException;
import java.io.FileReader;
import java.io.IOException;
import java.lang.StringBuffer;
import java.util.regex.*;
public class Loader {
public static void main(String[] args) {
BufferedReader file = null;
StringBuffer text = new StringBuffer();
String str;
try {
file = new BufferedReader(new FileReader("text.txt"));
} catch (FileNotFoundException ex) {
}
try
{
while ((str = file.readLine()) != null) {
text.append(str);
}
String cleanedText = text.toString().toLowerCase().replaceAll("[^:a-z]", "");
System.out.println(cleanedText);
} catch (IOException ex) {
}
}
}
The text file is a normal article where I want to delete everything (including whitespaces) that is not a letter. An extract is as follows "[16]The Free Software Foundation (FSF), started in 1985, intended the word "free" to mean freedom to distribute"