Efficient way to replace all special characters and numbers in a large text file in Java

Question

I'm currently working on a program that creates a pie chart based on frequencies of letters in a text file, my test file is relatively large and although my program works great on smaller files it is very slow for large files. I want to cut down the time it takes by figuring out a more efficient way to search through the text file and remove special characters and numbers. This is the code I have right now for this portion:

public class readFile extends JPanel {
protected static String stringOfChar = "";
    public static String openFile(){
    String s = "";
            try {
                BufferedReader reader = new BufferedReader(new FileReader("xWords.txt"));
                while((s = reader.readLine()) != null){
                    String newstr = s.replaceAll("[^a-z A-Z]"," ");
                    stringOfChar+=newstr;
                }
                reader.close();
                return stringOfChar;
            }
            catch (Exception e) {
                System.out.println("File not found.");
            }
            return stringOfChar;
    }

The code reads through the text file character by character, replacing all special characters with a space, after this is done I sort the string into a hashmap for characters and frequencies.

I know from testing that this portion of the code is what is causing the bulk of extra time to process the file, but I'm not sure how I could replace all the characters in an efficient manner.

`stringOfChar+=newstr;` don't build result string by concatenating new parts to it in loop. Use `StringBuilder` and its `append` method (more info - or possible duplicate: http://stackoverflow.com/questions/5234147/why-stringbuilder-when-there-is-string). If you want to add delimiter you can use `StringJoiner` and its `add` method. — Pshemo, Nov 28 '16 at 16:10
In fact, don't build a string result at all, since all you would do with it is split it into characters again. Feed the characters straight into the counting mechanism instead. (And please don't put your file-reading logic into a `JPanel`. Separating UI from business logic isn't much effort but well worth it.) — biziclop, Nov 28 '16 at 16:12

Sergey Kalinichenko · Accepted Answer · 2016-11-28T16:33:15.227

3

Your code has two inefficiencies:

It constructs throw-away strings with special characters replaced by space in s.replaceAll
It builds large strings by concatenating String objects with +=

Both of these operations create a lot of unnecessary objects. On top of this, the final String object is thrown away as well as soon as the final result, the map of counts, is constructed.

You should be able to fix both these deficiencies by constructing the map as you read through the file, avoiding both the replacements and concatenations:

public static Map<Character,Integer> openFileAndCount() {
    Map<Character,Integer> res = new HashMap<Character,Integer>();
    BufferedReader reader = new BufferedReader(new FileReader("xWords.txt"));
    String s;
    while((s = reader.readLine()) != null) {
        for (int i = 0 ; i != s.length() ; i++) {
            char c = s.charAt(i);
            // The check below lets through all letters, not only Latin ones.
            // Use a different check to get rid of accented letters
            // e.g. è, à, ì and other characters that you do not want.
            if (!Character.isLetter(c)) {
                c = ' ';
            }
            res.put(c, res.containsKey(c) ? res.get(c).intValue()+1 : 1);
        }
    }
    return res;
}

edited Nov 28 '16 at 16:33

answered Nov 28 '16 at 16:14

Sergey Kalinichenko

714,442
84
1,110
1,523

StringBuilder solved the problem but I agree that this implementation definitely gets rid of unnecessary operations! This is my first project implementing a map of any kind so this didn't come to mind! Thanks a lot!! – Jeni Nov 28 '16 at 16:30
Character.isLetter return true also for special letters like è, à, ì and so on. SO can't be used as an equivalent of the regular expression [a-z A-Z] – Davide Lorenzo MARINO Nov 28 '16 at 16:31
@DavideLorenzoMARINO That's a fair observation, I added a comment to mention that some potentially unwanted characters may get through. Thanks! – Sergey Kalinichenko Nov 28 '16 at 16:34
If you need to change the code of the op to return a different data structure probably an array of int is a better solution here than a Map.You can return an array of 27 (26 letters + 1 space) or 53 (26 uppercase letters, 26 lowercase letters and 1 space) integers. You don't have any the overhead of the Map and the code doesn't need a retrieve and a ternary operator – Davide Lorenzo MARINO Nov 28 '16 at 16:40
@DavideLorenzoMARINO That is also true. I went for hash map because OP mentioned that he converts the string "into a hashmap for characters and frequencies," but given his restrictions a simple array of 53 integers would be even more efficient. – Sergey Kalinichenko Nov 28 '16 at 16:47

Davide Lorenzo MARINO · Answer 2 · 2016-11-28T16:27:18.820

Instead of using the operator + use the class StringBuilder to concatenate strings:

A mutable sequence of characters.

It is a lot more efficient.

Concatenate strings generate a new string for each concatenation. So if you need to that for many times you have a lot of string creations for intermediate strings that are never used because you need only the final result.

A StringBuilder use a different internal representation so it is not necessary to create new objects for every concatenation.

Also replaceAll is very unefficient creating a new String every time.

Here a more efficient code using StringBuilder:

...
StringBuilder build = new StringBuilder();
while((s = reader.readLine()) != null){
    for (char ch : s) {
        if (!(ch >= 'a' && ch <= 'z') 
              && !(ch >= 'A' && ch <= 'Z')
              && ch != ' ') {
            build.append(" ");
        } else {
            build.append(ch);
        }
    }
}
... 
return build.toString();
...

StringBuilder made it run beautifully! Thanks for the recommendation!! — Jeni, Nov 28 '16 at 16:31

Efficient way to replace all special characters and numbers in a large text file in Java

2 Answers2