3

I have a large text file(about 20 million lines) which has lines in the following format :

<string1>, <string2>

Now those strings may have trailing or leading whitespaces which I want to remove on reading the file.

I am currently using trim() for this purpose but since String in Java is immutable, trim() is creating a new object per trim operation. This is leading to too much wastage of memory.

How can I do it better?

Wiktor Stribiżew
  • 607,720
  • 39
  • 448
  • 563
  • 3
    Please show how you are reading the file and then splitting the strings. – Andy Turner Feb 03 '17 at 11:11
  • 1
    You do realize that any unused Strings are collected, so there's no real *waste* of memory, just new created objects (which are efficiently collected by the GC). – Kayaman Feb 03 '17 at 11:16
  • I am not quite sure but I think using [sed](http://www.grymoire.com/Unix/Sed.html) could solve the problem – Imran Ali Feb 03 '17 at 11:16
  • 1
    Show the code that you are using to read in the file; with almost complete certainty, trim() will turn out not to be the main memory bottleneck. – tucuxi Feb 03 '17 at 11:38
  • Split your String with comma separator and then , Append each String using StringBuilder .So String not created each time as you said . – Chetan Joshi Feb 03 '17 at 11:48
  • I looked at the logs and full garbage collection was happening frequently..which lead to the application not running at all.. Also, it is a singleton object pattern which is being used.That's why I wanted to narrow down the probable reasons for too much memory being used. – Abhishek Kaushik Feb 03 '17 at 11:49

7 Answers7

2

I would be surprised if the immutable String class is causing problems; the JVM is very efficient and the result of many years of engineering work.

That said, Java does provide a mutable class for manipulating strings called StringBuilder. You can read the docs here.

If you are working across threads, consider using StringBuffer.

Community
  • 1
  • 1
sdgfsdh
  • 33,689
  • 26
  • 132
  • 245
0

You can read your string as a stream of characters, and record the start and end position of each token you want to parse.

This still creates an object per token, but if your tokens are relatively long, the two int fields your object will contain are much smaller than the corresponding string would be.

But before you embark on that journey, you should probably just make sure you don't keep your trimmed strings for more time than it is needed.

biziclop
  • 48,926
  • 12
  • 77
  • 104
0

Assuming you have a String containing <string1>, <string2>, and you just want to split it without maybe trimming the parts:

String trimmedBetween(String str, int start, int end) {
  while (start < end && Character.isWhitespace(str.charAt(start)) {
    ++start;
  }

  while (start < end && Character.isWhitespace(str.charAt(end - 1)) {
    --end;
  }

  return str.substring(start, end);
}

(Note this is basically how String.trim() is implemented, just with start and end instead of 0 and length)

Then call like:

int commaPos = str.indexOf(',');
String firstString = trimmedBetween(str, 0, commaPos);
String secondString = trimmedBetween(str, commaPos + 1, str.length());
Andy Turner
  • 137,514
  • 11
  • 162
  • 243
  • I do want to trim the parts i.e. the individual strings. – Abhishek Kaushik Feb 03 '17 at 11:36
  • Why would I ever want to use this trim instead of the default one? The goal was to avoid memory waste, but you use the same extra memory (= you return a new string) as the built-in `trim()` – tucuxi Feb 03 '17 at 11:47
  • Because `String.trim()` only trims from the beginning and end of the string. To use that you have to split the string (creates an array, and two strings), then trim them (up to two more strings). This approach creates exactly two Strings, instead of 4 Strings and an array. – Andy Turner Feb 03 '17 at 12:26
0

As you already noticed, Strings are immutable. So the solution is to not use String, but rather something that is mutable. StringBuffer is a suitable class.

However, StringBuffer does not include a trim method, so you can use something like:

void trim(StringBuffer sb) {
    int start = 0;
    while (sb.length() > start && Character.isWhitespace(sb.charAt(0))) {
        start++;
    }
    sb.delete(0, start - 1);

    int end = 0;
    while (sb.length() > end && Character.isWhitespace(sb.charAt(sb.length() - 1))) {
        end++;
    }
    sb.delete(sb.length() - end, sb.length() - 1);
}
Simon Farshid
  • 2,636
  • 1
  • 22
  • 31
0

If you want to avoid String then you have to handle it yourself using char and StringBuilder, like this:

public class Test {
    public static void main(String... args) throws Exception {
        InputStreamReader in = new InputStreamReader(new FileInputStream("<testfile>"), "UTF-8");

        char[] buffer = new char[32768];
        int read = -1;
        int index;
        StringBuilder content = new StringBuilder();
        while ((read = in.read(buffer)) > -1) {
            content.append(buffer, 0, read);
            index = 0;
            while (index > -1) {
                index = content.indexOf("\n");
                if (index > -1) {
                    char[] temp = new char[index];
                    content.getChars(0, index, temp, 0);
                    handleLine(temp);
                    content.replace(0, index + 1, "");
                }
            }
        }

        in.close();
    }

    private static void handleLine(char[] line) {
        StringBuilder content = new StringBuilder().append(line);
        int start = 0;
        int end = content.length();
        if (end > 0) {
            char ch = content.charAt(0);
            while (Character.isWhitespace(content.charAt(start))) {
                start++;
                if (end <= start) {
                    break;
                }
            }
            if (start < end) {
                while (Character.isWhitespace(content.charAt(end - 1))) {
                    end--;
                    if (end <= start) {
                        break;
                    }
                }
            }
        }

        System.out.println("***" + content.subSequence(start, end) + "***");
    }
}
markbernard
  • 1,412
  • 9
  • 18
0

We could handle by Regex.

   {
    String str = "abcd, efgh";
    String [] result = str.split("(,\\s)|,");
    Arrays.asList(result).forEach(s -> System.out.println(s));
   }
Vinod
  • 300
  • 3
  • 18
-1

i think you can directly write the result data to a new file.

String originStr = "   xxxxyyyy";
for (int i = 0; i < originStr.length(); i++) {
    if (' ' == originStr.charAt(i)) {
        continue;
    }
    NewFileOutPutStream.write(originStr.charAt(i));
}
Axl
  • 1
  • if u using m-thread model, you can separated your file, let them to be few chunk file for logical, and then above method is also worked well. – Axl Feb 03 '17 at 11:32
  • Writing a single char at a time will take forever. You need to buffer it. – markbernard Feb 03 '17 at 11:38