I'm trying to load large CSV formatted files (typically 200-600mb) efficiently with Java (less memory and as fast as possible access). Currently, the program is utilizing a List of String Arrays. This operation was previously handled with a Lua program using a table for each CSV row and a table to hold each "row" table.
Below is an example of the memory differences and load times:
- CSV File - 232mb
- Lua - 549mb in memory - 157 seconds to load
- Java - 1,378mb in memory - 12 seconds to load
If I remember correctly, duplicate items in a Lua table exist as a reference to the actual value. I suspect in the Java example, the List is holding separate copies of each duplicate value and that may be related to the larger memory usage.
Below is some background on the data within the CSV files:
- Each field consists of a String
- Specific fields within each row may include one of a set of Strings (E.g. field 3 could be "red", "green", or "blue").
- There are many duplicate Strings within the content.
Below are some examples of what may be required of the loaded data:
- Search through all Strings attempting to match with a given String and return the matching Strings
- Display matches in a GUI table (sort able via fields).
- Alter or replace Strings.
My question - Is there a collection that will require less memory to hold the data yet still offer features to easily and quickly search/sort the data?