5

I'm trying to load large CSV formatted files (typically 200-600mb) efficiently with Java (less memory and as fast as possible access). Currently, the program is utilizing a List of String Arrays. This operation was previously handled with a Lua program using a table for each CSV row and a table to hold each "row" table.

Below is an example of the memory differences and load times:

  • CSV File - 232mb
  • Lua - 549mb in memory - 157 seconds to load
  • Java - 1,378mb in memory - 12 seconds to load

If I remember correctly, duplicate items in a Lua table exist as a reference to the actual value. I suspect in the Java example, the List is holding separate copies of each duplicate value and that may be related to the larger memory usage.

Below is some background on the data within the CSV files:

  • Each field consists of a String
  • Specific fields within each row may include one of a set of Strings (E.g. field 3 could be "red", "green", or "blue").
  • There are many duplicate Strings within the content.

Below are some examples of what may be required of the loaded data:

  • Search through all Strings attempting to match with a given String and return the matching Strings
  • Display matches in a GUI table (sort able via fields).
  • Alter or replace Strings.

My question - Is there a collection that will require less memory to hold the data yet still offer features to easily and quickly search/sort the data?

user1816198
  • 245
  • 1
  • 3
  • 7
  • 1
    if you know that column 3 only holds a few possible values, you could [intern them](http://docs.oracle.com/javase/7/docs/api/java/lang/String.html#intern%28%29) to reduce the memory usage. See also: http://stackoverflow.com/a/1855195/829571 – assylias Nov 11 '12 at 15:46
  • Thanks assylias I will run some tests using that. Do you know if it's efficient for short Strings - E.g. "To" or "Go". Most of the fields contain strings that are 45 characters+, however, some are quite short (4 or less). – user1816198 Nov 11 '12 at 15:52
  • 2
    Have look at http://stackoverflow.com/questions/12792942/alternatives-to-java-string-interning – Peter Lawrey Nov 11 '12 at 15:52
  • @PeterLawrey Nice one - how does it perform vs. `intern()`? – assylias Nov 11 '12 at 15:56
  • 1
    @assylias It is faster and scales better, but it only works on a best effort basis, you would get all the duplicates if your size is smaller than the number of unique objects. – Peter Lawrey Nov 11 '12 at 16:05
  • I just attempted to load the same data using intern(). This already resulted in substantial memory usage improvement at a small cost to load time. 596mb in memory - 26 seconds to load. There are a lot of great suggestions here... going to run tests with some of those. – user1816198 Nov 11 '12 at 17:07

4 Answers4

1

One easy solution. You can have some HashMap were you will put references to all unique strings. And in ArrayList you will just have reference to existing unique strings in HashMap.

Something like :

private HashMap<String, String> hashMap = new HashMap<String, String>();

public String getUniqueString(String ns) {
   String oldValue = hashMap.get(ns);
   if (oldValue != null) { //I suppose there will be no null strings inside csv
    return oldValue;
   }        
   hashMap.put(ns, ns);
   return ns;
}

Simple usage:

List<String> s = Arrays.asList("Pera", "Zdera", "Pera", "Kobac", "Pera", "Zdera", "rus");
List<String> finS = new ArrayList<String>();
for (String er : s) {
   String ns = a.getUniqueString(er);
   finS.add(ns);
}
Igor
  • 1,835
  • 17
  • 15
  • 1
    sound's like you're trying to optimize things already optimized by java (saving memory for dupplicate strings in memory), no need for such an implementation, see my answer – Peter Butkovic Nov 11 '12 at 18:10
0

To optimise your your Memory problem i advice to use the Flyweight pattern, specially for fields that have a lot of duplicates.

As a Collection you can use a TreeSet or TreeMap.

If you give a good implementation to your LineItem class (implement equals, hashcode and Comparable) you can optimise the memory use a lot.

Frank
  • 16,476
  • 7
  • 38
  • 51
0

DAWG

A directed acyclic word graph is the most efficient way to store words (best for memory consumption anyway).

But probably overkill here, as others have said don't create duplicates just make multiple references to the same instance.

NimChimpsky
  • 46,453
  • 60
  • 198
  • 311
  • Thanks I will look into this option some more. I wouldn't consider anything overkill yet - the more efficient this is the more data can be loaded per session and that's better for the end user. – user1816198 Nov 11 '12 at 16:05
0

just as a side note.

For the duplicate string data you doubt, you don't need to worry about that, as java itself cares of that as all strings are final, and all references target the same object in memory.

so not sure how lua does the job, but in java it should be also quite efficient

Peter Butkovic
  • 11,143
  • 10
  • 57
  • 81
  • But if this is true than equals is unnecessary at all and == will do job for comparasion – Igor Nov 11 '12 at 16:56
  • well, equals is the correct way, as it's the way you should compare objects in java, == would work too, but it's only kind of side effect, due to way JVM internally handles strings – Peter Butkovic Nov 11 '12 at 18:07
  • Well, I'm not sure how much memory java vm internally holds for keeping string references, but i'm pretty sure that in enough large program == will not work – Igor Nov 11 '12 at 18:10
  • you're joking, right? see: http://stackoverflow.com/questions/767372/java-string-equals-versus (Michal Bernhard's reply), why would JVM reference only some strings (not all) in such an optimized way? – Peter Butkovic Nov 11 '12 at 18:12
  • Yup, but in this problem example, I'm pretty sure @user1816198 will not have bunch of static "Some String" strings but full load of dynamic strings(I suppose he will be using StringBuilders or something to parse csv). Try for example this simple program String a = "Helo,what s up,baby,Hello,baby"; String[] b = a.split(","); for (String c : b) { System.out.println(c); } if (b[0] == b[3]) { System.out.println("Equals Hello"); } if (b[2] == b[4]) { System.out.println("Equals baby"); } Last two ifs are false on my machine. – Igor Nov 11 '12 at 18:21
  • correct, one string is one object reference, no objection on that – Peter Butkovic Nov 11 '12 at 18:29
  • Ok we agree on that :). Example on my answer is really bad I agree, but its only there to show how to use getUniqueString. Parsing csv file will surely create multiple string objects and will not have static "Pera", "Zdera", "Pera", "Kobac", "Pera", "Zdera", "rus"... – Igor Nov 11 '12 at 18:32
  • Two String literals `a="Zebra"` and `b="Zebra"` will point to the same object in memory, two Strings `a=new String("Zebra"); b=new String("Zebra")` will not. In a CSV parser, the latter is the relevant case (unless it uses String.intern or an equivalent). – Jannis Froese May 15 '14 at 16:28