Strange behavior with Regex in Java

Question

I want to filter a text, leaving only letters (a-z and A-Z). It seemed to be easy, following something like this How to filter a Java String to get only alphabet characters?

String cleanedText = text.toString().toLowerCase().replaceAll("[^a-zA-Z]", "");         
System.out.println(cleanedText);

The problem that the output of this is empty, unless I change the regex, adding another character, e.g. : --> [^:a-zA-Z]

I allready tried to check if it works with normal regex (not using the method ReplaceAll given by String object in Java), but I had exactly the same problem.

Any idea what could be the source of this strange behavior?

I had a txt file which I read using a BufferedReader. I add each line to one long string and apply the code I posted before to this. The whole code is as follows:

import java.io.BufferedReader;
import java.io.FileNotFoundException;
import java.io.FileReader;
import java.io.IOException;
import java.lang.StringBuffer;
import java.util.regex.*;

public class Loader {

    public static void main(String[] args) {

        BufferedReader file = null;
        StringBuffer text = new StringBuffer();
        String str;

        try {
            file = new BufferedReader(new FileReader("text.txt"));
        } catch (FileNotFoundException ex) {
        }
        try

        {
            while ((str = file.readLine()) != null) {
                text.append(str);

            }

            String cleanedText = text.toString().toLowerCase().replaceAll("[^:a-z]", "");       
            System.out.println(cleanedText);
        } catch (IOException ex) {
        }
    }   
}

The text file is a normal article where I want to delete everything (including whitespaces) that is not a letter. An extract is as follows "[16]The Free Software Foundation (FSF), started in 1985, intended the word "free" to mean freedom to distribute"

Please add some examples, btw: you do not need A-Z when you called toLowerCase before ;-) — Betlista, Jul 27 '17 at 10:39
It works for me. Program: class RegexSample { public static void main(String args[]) { String text = "fdsfsdfsd fg 3443#$@fvc3G##DVD"; String cleanedText = text.toString().toLowerCase().replaceAll("[^a-zA-Z]", ""); System.out.println(cleanedText); } } Output: fdsfsdfsdfgfvcgdvd — padippist, Jul 27 '17 at 10:54

score 1 · Answer 1 · answered Jul 27 '17 at 10:44

as I wrote in a comment, specify more precisely what's wrong...

What I tried

public class Regexp45348303 {

    public static void main(String[] args) {
        String[] tests = { "abc01", "01DEF34", "abc 01 def.", "a0101\n0202\n0303x" };
        for (String text : tests) {
            String cleanedText = text.toLowerCase().replaceAll("[^a-z]", ""); // A-Z removed too     
            System.out.println(text + " -> " + cleanedText);
        }
    }
}

and the output is:

abc01 -> abc
01DEF34 -> def
abc 01 def. -> abcdef
a0101
0202
0303x -> ax

which is correct based on my understanding...

I tried some more and came to the conclusion that is has to be the length of the text, because with if I separate it, it works perfectly, but if I try to make it in one go it throws an empty result. — Felix, Jul 28 '17 at 06:52

score 1 · Accepted Answer · edited Jun 20 '20 at 09:12

In the end the problem was not with the regex nor with the program itself. It was just that eclipse does not show the output in console if it exceeds a certain length (but you can still work on it). To solve this simply check the fixed width console in Window -> Preferences -> Run/Debug -> Console as described in http://code2care.org/2015/how-to-word-wrap-eclipse-console-logs-width/

Image of where to check fixed width console checkbox

Strange behavior with Regex in Java

2 Answers2