1

I have a data structure:

ArrayList<String>[] a = new ArrayList[100000];

each list has about 1000 strings with about 100 characters.

I'm doing an one-off job with it, and it cost a little more memory than I can bear.

I think I can change less code if I can find ways to reduce some memory cost , as the cost is not too much , and it's just an one-off job. So, please tell me all possible ways you know.

add some info: the reason I;m using a array of arraylists is that the size 100000 is what I can know now. But I don't know the size of each arraylist before I work through all the data.

And the problem is indeed too much data, so I want to find ways to compress it. It's not a allocation problem. There will finally be too much data to exceed the memory.

Mobility
  • 3,117
  • 18
  • 31
  • 1
    First, do you need all the list at the same time ? – AxelH May 26 '17 at 11:36
  • You don't have to define any length if you use `List a = new ArrayList<>();`. Using a defined list size which isn't fully filled (`null` values) might take more memory then you need. – steven May 26 '17 at 11:38
  • @AxelH. I think so. What I need to do is to print it in order at the end. And the data come as a random order. – Mobility May 26 '17 at 11:38
  • @AxelH Oh, never saw such initialization before. But my point remains the same, no need to initialize such a big array size. – steven May 26 '17 at 11:40
  • @AxelH ofcourse I have, just not an `ArrayList` with brackets. Why would you set a size for such lists? Your example here is a basic initialization of an arraylist and it requires a size. Just don't see the point of providing a size in OP's example, that's all :) – steven May 26 '17 at 11:47
  • @steve Because `ArrayList[10]` is just defining an array for `ArrayList`. It has nothing to do with the lists, it is about the `ArrayList` array that is declared. There is no instance of `ArrayList` yet, as every cell will be null. (PS : I remove the previous comment as it is off-topic) ;) – AxelH May 26 '17 at 11:49
  • 1
    Do you need to have all the data available through the entire algorithm? Or even at the end? Do you have a way to break down the data in independent parts? Because as of now the problem you're proposing seems impossible to solve from a general standpoint (i.e i have way too much data, but i need it all throughout all steps of my calculation) – Roc Aràjol May 26 '17 at 11:51
  • Just store the strings in files, one file for each list. – Steve Smith May 26 '17 at 11:53
  • 2
    Possible duplicate of [How do I sort very large files](https://stackoverflow.com/questions/7918060/how-do-i-sort-very-large-files) – AxelH May 26 '17 at 11:53
  • You will need to uses files if the data are not in file already. This way, you can used algorithm (see the duplicate) to sort those files without have to load everything at once. – AxelH May 26 '17 at 11:54
  • @AxelH year, what you and some others said are all correct. I'm just trying to find a shortcut here. – Mobility May 26 '17 at 12:05
  • @MallowFox "_I'm just trying to find a shortcut here._" well you won't find a shortcut in compression. Just shunk the data in files with acceptable size, sort the files then read the files to create a sorted one. Not that hard. This should be possible to write in a couple of hours, maybe less. – AxelH May 26 '17 at 12:11
  • You did not yet answer any of the questions regarding whether or not you need all that data in memory at the same time. Besides, you did not state the kind of operation you want to perform, so any advice to use some compression may just not work for you in the first place. – JimmyB May 26 '17 at 12:21
  • @JimmyB It was really basicly define in comment that OP need to "_What I need to do is to print it in order at the end_", this is the reason of the duplicate I proposed ;) – AxelH May 26 '17 at 12:22
  • @AxelH. year, so seems finally I have to do it in the tradition way. thank you and thanks everyone. – Mobility May 26 '17 at 12:23
  • Oh, ok. Overlooked that comment. – JimmyB May 26 '17 at 12:33
  • @MallowFox if your problem is strictly of sorting, i suggest looking into external sorting algorithms as described in this question https://stackoverflow.com/questions/2087469/sort-a-file-with-huge-volume-of-data-given-memory-constraint – Roc Aràjol May 26 '17 at 12:35
  • One more question: Do you need UTF-16 or can you get away with pure ASCII? – JimmyB May 26 '17 at 12:40
  • @JimmyB I need some UTF characters. – Mobility May 26 '17 at 12:42

2 Answers2

1

it cost a little more memory than I can bear

So, how much is "a little"?

Some quick estimates:

You have collections of string of 1000x100 characters. That should be about 1000x100x2 = 200kb of string data.

If you have 100000 of those, you'll need almost 20Gb for the data alone.

Compared to the 200kb of each collection's data the overhead of your data structures is miniscule, even if it was 100 bytes for each collection (0.05%).

So, not much to be gained here.

Hence, the only viable ways are:

  • Data compression of some kind to reduce the size of the 20Gb payload

  • Use of external storage, e.g. by only reading in strings which are needed at the moment and then discarding them

To me, it is not clear if your memory problem really comes from the data structure you showed (did you profile the program?) or from the total memory usage of the program. As I commented on another answer, resizing an array(list) for instance temporarily requires at least 2x the size of the array(list) for the copying operation. Then notice that you can create memory leaks in Java - or just be holding on to data you actually won't need again.

Edit:

A String in Java consists of an array of chars. Every char occupies two bytes.

You can convert a String to a byte[], where any ASCII character should need one byte only (non-ASCII characters will still need 2 (or more) bytes):

str.getBytes(Charset.forName("UTF-8"))

Then you make a Comparator for byte[] and you're good to go. (Notice though that byte has a range of [-128,127] which makes comparing non-intuitive in this case; you may want to compare (((int)byteValue) & 0xff).)

JimmyB
  • 12,101
  • 2
  • 28
  • 44
  • A compression of 20Gb of data would be complicated to reach at least 8Gb (maybe less). As it would need to load an important amount of data to be able to have a good compression rate. (depending on the algorithm I know ;) ) – AxelH May 26 '17 at 11:58
  • year. The data is around 20G. And I can get about 26 or 27GB memory. So some little compression can be helpful. – Mobility May 26 '17 at 12:03
  • @AxelH. it's a yes with quite big probability. And an error and a restart is also acceptable to me, as I just need less than 10 minutes to run the program. – Mobility May 26 '17 at 12:11
  • @MallowFox I am more afraid that you use the available memory for 10minutes (that's a lot) for the rest of the running application, those might not be that easy to restart after an `OutOfMemoryError`... (but I stop spamming JimmyB answers). – AxelH May 26 '17 at 12:13
  • @AxelH Depending on the nature of the strings (natural language?) you can get decent compression with a comparably small lookup window of a few kb or mb. Even just converting from `char[]` (UTF-16) to `byte[]` (ASCII) can potentially save 50%. – JimmyB May 26 '17 at 12:18
  • @JimmyB mmmh, interesting. I might need to document myself on the subject then! Indeed, if a byte is enough for the character sets in the data, this could be quite effective! – AxelH May 26 '17 at 12:21
  • @AxelH [LZSS](https://en.wikipedia.org/wiki/Lempel%E2%80%93Ziv%E2%80%93Storer%E2%80%93Szymanski) for example can be configured with virtually any size of lookup window (degenerating to RLE for window size 1). – JimmyB May 26 '17 at 12:26
  • This is indeed quite effective. Thanks @JimmyB for that part of documentation :) This would still need to be decompressed to sort the data (or at least decompressed to create the sorted data) since you would change the value and then the chunk position. – AxelH May 26 '17 at 12:34
0

Why are you using Arrays when you don't know the size at compile time itself, Size is the main concern why Linked lists are preferable over arrays

ArrayList< String>[] a = new ArrayList[100000];

Why are you allocating so much memory at once initially, ArrayList will resize itself whenever required you need not do it, manually.

I think below structure will suffice your requirement

List<List<String> yourListOfStringList = new ArrayList<>();
Neeraj Jain
  • 7,643
  • 6
  • 34
  • 62
  • but the problem is not at the beginning. Finally there will be too much data stored. – Mobility May 26 '17 at 11:41
  • 1
    Note though that peak memory use may increase during the resizing operation to at least 2x list.size() for the copying of the underlying array. – JimmyB May 26 '17 at 11:41
  • OP is creating an array of lists, not a giant arraylist. We don't know that he might want 100,000 lists. – Steve Smith May 26 '17 at 11:41
  • @JimmyB Sir I think A Vector defaults to doubling the size of its array, while the ArrayList increases its array size by 50 percent. – Neeraj Jain May 26 '17 at 11:44
  • @NeerajJain it's about the memory usage needed to copy the array to a bigger one Not the actual resizing. – AxelH May 26 '17 at 11:46
  • @SteveSmith that's the point OP doesn't know what size he should use, that's why he has given a big number to initialize the array. I agree with Jimmy in peak cases this solution is not optimal but it will definitely help in other cases – Neeraj Jain May 26 '17 at 11:46
  • @MallowFox Loading this big data in memory is not recommended you should use a chunk of data which is actually crucial to the data processing. – Neeraj Jain May 26 '17 at 11:48
  • @NeerajJain OP knows that he will have 100 000 lists of strings, each of the lists of unknown size – Roc Aràjol May 26 '17 at 11:48
  • @RocAràjol and where is that written sir. he only states `each list has about 1000 strings with about 100 characters.` – Neeraj Jain May 26 '17 at 11:49
  • 1
    @NeerajJain OP just edited this to add this information ;) "_add some info: the reason I;m using a array of arraylists is that the size 100000 is what I can know now_" – AxelH May 26 '17 at 11:50
  • @NeerajJain. That's correct. I'm just trying to find a shortcut here, if possible. – Mobility May 26 '17 at 11:55