Java HashMap that takes to much of the memory

Question

The problem is that my hashmap is taking too much space. I wanna know if the code can be done in a more efficient way for not taking that much memory. I have an huge array and the reason why im using HashMap is because I want a fast way to print out the first occurence of where key = 3 as shown in the code. But the problem is now the memory. I still want it to be relatively fast O(n log n)


ArrayList<String> str = new ArrayList<>();
Map<String, Long> counts2 = new LinkedHashMap<String, Long>();
for(String val : str){
    long count = counts2.getOrDefault(val, 0L);
    counts2.put(val, ++count);
}
for(String key: counts2.keySet()){
    if(counts2.get(key)==3){
        System.out.println(list.indexOf(key));
        break;
    }
}

I had a similar need and ended up building a hash map that was backed by a fast in process, key value, disk storage system. For example https://code.google.com/archive/p/jdbm2/ — bhspencer, Sep 09 '22 at 18:47
@user16320675 `or memory, I would try removing entries with count > 3` - I doubt if this can be efficient, since these entries might repair. And it's error-prone. Consider that there's a string that occur `7` times. The entry gets removed when the count is `4`, and then it reappears with the count of `3` - bingo, now it can affect the result. — Alexander Ivanchenko, Sep 09 '22 at 18:55

OscarRyz · Answer 1 · 2022-09-09T19:29:27.277

0

Update: You should not use the following

I'm just going to leave it here for a bit as a learning of what not to do. Today I learned using a hashcode for sole comparison is not enough. I think the idea of short-circuiting is good, but doesn't seem to be a concern. HashMap already does a great job resolving collisions, and a implementation that replicates that might end up using as much memory as the initial version.

Related questions:

Java: Use hashCode() inside of equals() for convenience?
Two strings: same hashcode
What is a good 64bit hash function in Java for textual strings?

Original answer follows:

One way could be to store the hash instead of the whole string:

...
var count = new HashMap<Integer, Long>();
for(String val: list) {
   count.put(val.hashCode(), count.getOrDefault(val.hashCode(), 0L)+1);
}

Expanding on @Alexander's idea, I think you can save space and computation by saving the hash and the index instead of the plain string and recounting ( + short circuiting )

So:

Iterate the list
Search in the map, if seen for the first time save index and count = 1
If seen before increment count
If count is 3 finish.

import java.util.*;

class SpaceTime {

  public static void main(String ... args) {

    var input = Arrays.asList("one", "two", "three", "two", "three", "two");
    var map = new HashMap<Integer, CountAndIndex>();

    for (int i = 0 ; i < input.size(); i++ ) {
      var s = input.get(i);
      var hc = s.hashCode();
      var cai = map.getOrDefault(hc, startAt(i));
      cai.count++;
      if (cai.count == 3) {
        System.out.printf("We've got it!!. Item: '%s' appears for the first time at index: %d%n", s, cai.index);
        break;
      }
      map.put(hc, cai);
    }
  }
  static CountAndIndex startAt(int index) {
    var cai = new CountAndIndex();
    cai.count = 0;
    cai.index = index;
    return cai;
  }
}

class CountAndIndex {
  long count;
  long index;
}
// output: 

We've got it!!. Item: 'two' appears for the first time at index: 1

edited Sep 09 '22 at 19:29

answered Sep 09 '22 at 17:32

OscarRyz

196,001
113
385
569

The values returned by hashCode, for any class, are explicitly and deliberately not required to be unique. Two different strings could have the same hash code. (Though I haven’t actually looked into how to make this happen.) – VGR Sep 09 '22 at 17:50
@VGR They are not unique as in a `String` and `Point` objects can return the same hash, but: _" the hashCode method must consistently return the same integer"_ [source](https://docs.oracle.com/javase/8/docs/api/java/lang/Object.html#hashCode--). That's the value used by `HashMap` after all. – OscarRyz Sep 09 '22 at 17:59
2

Right, but if the hash code is used as a key, you are counting occurrences of each hash code. If two strings happen to generate the same hash code, you are counting both of them together. – VGR Sep 09 '22 at 18:00
1

yes, it "*must consistently return the same integer*" but that must and cannot be unique for different strings! There are more different strings than integers - e.g `"0O".hasCode() == "10".hashCode()` – user16320675 Sep 09 '22 at 18:10
So is the solution not valid? – ZedORYasuo Sep 09 '22 at 18:13
1

(obviously I meant `"0O".hashCode() == "10".hashCode()`) @Zed - that depends on your requirements, probably you don't want to risk having different strings to be counted together (this can be used as an approximation or first guess) – user16320675 Sep 09 '22 at 18:18
@ZedORYasuo is up to you and your use case. The points brought by VGR and user123 are valid: two different Strings can compute the same hashcode (and your original code would have exactly the same problem) I personally (and gazzllions of Java code) happily use `String#hashCode()` all the time. You can add an extra computation (e.g. int x = hash^31) but then you'll land on a realm better solve by the core libraries. – OscarRyz Sep 09 '22 at 18:19
@user16320675 I didn't know it was so common. I think this solution is still valid **IF** a better computation is made. I'll update my answer. – OscarRyz Sep 09 '22 at 18:26
Okay guys, does anyone of u got any better & easier solution to show for this problem? – ZedORYasuo Sep 09 '22 at 18:27
The obvious, feasible solution is to add more memory. Alternatively you can try to compute a unique hash, or store a list for collisions, then, check first for the hash, if it's already there check the collisions list etc.This would increase the code complexity and might introduce more bugs. – OscarRyz Sep 09 '22 at 18:34
The code doesnt seem to work when I change cai.count to another n. So for example when I put "if (cai.count == 1)" it gives me weird output – ZedORYasuo Sep 09 '22 at 18:53
@ZedORYasuo That's weird, I don't know what are you seeing, but I ran it again and works with 1 as expected. That being said, I after reading the comments and testing more I think using the hashcode is a bad idea. There are very interesting information in other links like [this one](https://stackoverflow.com/questions/1660501/what-is-a-good-64bit-hash-function-in-java-for-textual-strings) but at the end end up using more space. I think you have to find a compromise, only you know how big your data is and when you can trade time for space. – OscarRyz Sep 09 '22 at 19:20
@OscarRyz add a "one" to the list and try again. Do you have any good way of solving the problem in an easy and effective way? (that maybe doesnt involve hash map) – ZedORYasuo Sep 09 '22 at 19:22
I added another `"one"` and change the test to `1` I still see the result as `0` anyway, I've looked a lot into different ways to implement a good hash and almost all of them either need more time or space, sometimes both, so I'll delete this answer later today, in the meantime I learned I should not use the `hashCode` method like this. Thank you @user16320675 and @VGR. @ZedOrYassuo, I would just stick to your original implementation that is super fast and take the memory price. Good Luck and have fun. – OscarRyz Sep 09 '22 at 19:27
it's not an issue to use String as key, because hash of string is cached inside ``` public int hashCode() { int h = hash; if (h == 0 && value.length > 0) { hash = h = isLatin1() ? StringLatin1.hashCode(value) : StringUTF16.hashCode(value); } return h; } ``` – Arthur Gazizov Sep 09 '22 at 19:32

Alexander Ivanchenko · Answer 2 · 2022-09-09T18:26:06.840

0

Since your primary concern is a space, you might consider the following performance trade off, which doesn't require allocation of additional memory.

for (int i = 1; i < strings.size(); i++) {
    String next = strings.get(i);
    if (Collections.frequency(strings,next) == 3) {
        System.out.println(i);
        break;
    }
}

edited Sep 09 '22 at 18:26

answered Sep 09 '22 at 18:07

Alexander Ivanchenko

25,667
5
22
46

I tried it its too slow. Im concerned about both – ZedORYasuo Sep 09 '22 at 18:27
2

@ZedORYasuo You need to **specify** that in the question. And note, that in most cases it's a trade off. The most performant solution is not always also the most space efficient, and vice versa. – Alexander Ivanchenko Sep 09 '22 at 18:30

score 0 · Answer 3 · answered Sep 16 '22 at 20:53

There are a couple of optimizations you can try on your current implementation. These are low-cost, quick wins:

Use explicit initial capacity and load factor

By using the appropriate constructor you can specify both initial capacity and load factor for your LinkedHashMap.

Load factor

Higher values decrease the space overhead but increase the lookup cost, according to the docs. You would have to experiment with values between 0.75 (the default) and 0.99 in order to find a sweet spot.

Initial capacity

By using a high value, you can minimize re-hashing due to buckets getting full. Since you are using LinkedHashMap, the impact of a large initial capacity is less critical since iteration time is unaffected. If your use case allows it, you can even eliminate re-hashing by choosing a large enough value to cover all distinct entries (i.e. if you have historical data of how many distinct keys you count or your dataset has finite elements anyway). If you can minimize/eliminate re-hashing, you also minimize any drawbacks cause by larger load factor values.

Only keep interesting entries

It seems that you only need to find one key per frequency. If this is true, you can reduce the data retained in memory once you are done and keep only one key per frequency (count).

Sample code


        Map<String, Long> counts2 = new LinkedHashMap<String, Long>(10_000, 0.95f); //Using the appropriate constructor
        for (String val : str) {
            long count = counts2.getOrDefault(val, 0L);
            counts2.put(val, ++count);
        }

        // Clean up unneeded (?) entries
        final HashMap<Long, Integer> data = new HashMap<>();
        for (Iterator<Map.Entry<String, Long>> it = counts2.entrySet().iterator(); it.hasNext();) {
            Map.Entry<String, Long> entry = it.next();
            if (data.containsKey(entry.getValue())) {
                it.remove();//Already exists; this will save space
            } else {
                data.put(entry.getValue(), str.indexOf(entry.getKey()));
            }
        }

        //You can now remove original counts2 now
        Integer indexOf3 = data.get(Long.valueOf(3));
        System.out.println(str.get(indexOf3) + " @ " + data.get(Long.valueOf(3)));

        //Original code
        for (String key : counts2.keySet()) {
            if (counts2.get(key) == 3) {
                System.out.println(key + " @ " + str.indexOf(key));
                break;
            }
        }

Bonus note:

Your use case reminded me of how Redis optimizes memory usage for hashes. This is an interesting approach, should you consider adding Redis to your stack.

John Gilmer · Answer 4 · 2022-10-30T12:09:04.987

I could think of 2 things to reduce your memory usage:

Use Byte instead of Long
Drop elements that you know occur more than 3 times. This one probably won't save that much memory since I had to create a HashSet to implement that mechanism. Also, doesn't really improve worst-case scenario, but should help on average.

    public static void main(String[] args) {
        var str = List.of("blue", "green", "red", "red", "orange", "green",
                "turquoise", "red", "green", "green");
        var exceeds3Count = new HashSet<String>();
        var counts = new HashMap<String, Byte>();
        for (String val : str) {
            if (exceeds3Count.contains(val)) continue;
            Byte count = counts.compute(val, (k, v) -> v == null ? 1 : (byte) (v + 1));
            if (count > 3) {
                counts.remove(val);
                exceeds3Count.add(val);
            }
        }
        System.out.println(counts);
        var answer = str.stream().filter(s -> counts.getOrDefault(s, (byte) 0) == 3).findFirst().get();
        System.out.println(answer + " is the first occurrence of a string that occurs 3 times");
    }