1

I have quite a large project to accomplish and I'm running into some dead ends. I wanted to see if the great community here had any suggestions.

I have a large data set and I'm attempting to build a social graph. The data contains over 9.5 million mappings of coordinates to a Short value. For the key values in the ConcurrentHashMap I am using a String, that is the coordinates concatenated with a ',' in between.

Essentially, I'm finding the number of groups in common between users. I have an initial hashmap that is built quite easily that maps a GroupID to a Vector of AvatarID's. This part runs fine. Then, I have 12 threads which are responsible for their own set of GroupIDs and processing (adding + 1 to the count between users in each groupID), all the accessing done from the ConcurrentHashMap.

After about 8000 groups have been processed, a problem with the accessing occurs. Only one thread at a time seems active, and I'm unsure if this is because of the massive size or another factor. This is a problem as I have 300,000 groups that need to be processed total (and in a timely manner).

Is there any advice as to how I'm implementing this, and any shortcuts I can use? The read and write I believe is equally important, as I have to read a coordinate if the value exists (if not create it) and then add one to the value and write back.

I am willing to provide code as needed, I just don't know which portions will be relevant to the discussion yet.

Thanks for your time, -mojavestorm

Further explanation:

Two Implementations and their limits:

1) I have a HashMap(Integer, Vector(Integer)) preMap that contains a GroupID as key and a Vector of userIDs. The threads split up the GroupIDs between each other and using each Vector(Integer) returned, each thread stores a short value according to a coordinate (saying UserID x and UserID y belong in (short) n groups together) into a TLongShortHashMap threadMap, and each thread owns its own threadMap. The coordinates are mapped to long values. After each thread is completed, the values of corresponding keys in each of the threadMaps are added to the same key in a combinedMap, which will show how many groups UserID x and UserID y belong to together in the whole system.

The problem with this implementation is that there is high overlap between threads, so excessive short values are created. For example User 1 and User 2 belong to various groups together. Thread A and Thread B are responsible for a their own range of groups, including ones User 1 and User 2 belong to, so both Thread A and Thread B store in their copy of the threadMap a long value for coordinate (1, 2) and a short value. If excessive overlap occurs then the memory requirement can be outstanding. In my case, all 46GB of ram I allocate to Java get used up, and quite quickly too.

2) Using the same preMap in this implementation, each thread is given a range of user coordinates they are responsible for. Each thread runs, and takes each coordinate it has and iterates through preMap, checking each groupID and seeing if UserID x and UserID y belong to the vector returned from the preMap. This implementation eliminates overlap that will occur between threadMaps.

The problem with this is time. Right now the program is running at a stunning rate of 1400 years to complete. The memory used wavers around 4GB to 15GB but seems to stay 'low'. Not completely sure it will stay within the limit, however, I imagine it will. There are no improvements that are apparent to me.

Hopefully these descriptions are clear and will help give insight to my problem. Thanks.

Mojave Storm
  • 571
  • 2
  • 7
  • 18

1 Answers1

4

I would have each thread process its own Map. This means each thread can work interdependently. Once the threads have finished you can combine all the results. (Or possibly combine the results as they complete, but this may add complexity with not much advantage)

If you are using a short I would use at a collection like TObjectIntHashMap which is more efficient for handling primitives.


In the simple case you have short co-ordinates public static void main(String... args) throws IOException { int length = 10 * 1000 * 1000; int[] x = new int[length]; int[] y = new int[length];

  Random rand = new Random();
  for (int i = 0; i < length; i++) {
    x[i] = rand.nextInt(10000) - rand.nextInt(10000);
    y[i] = rand.nextInt(10000) - rand.nextInt(10000);
  }

  countPointsWithLongIntMap(x, y);
  countPointsWithMap(x, y);

}

private static Map<String, Short> countPointsWithMap(int[] x, int[] y) {
  long start = System.nanoTime();
  Map<String, Short> counts = new LinkedHashMap<String, Short>();
  for (int i = 0; i < x.length; i++) {
    String key = x[i] + "," + y[i];
    Short s = counts.get(key);
    if (s == null)
      counts.put(key, (short) 1);
    else
      counts.put(key, (short) (s + 1));
  }
  long time = System.nanoTime() - start;
  System.out.printf("Took %.3f seconds to use Map<String, Short>%n", time/1e9);

  return counts;
}

private static TIntIntHashMap countPointsWithLongIntMap(int[] x, int[] y) {
  long start = System.nanoTime();
  TIntIntHashMap counts = new TIntIntHashMap();
  for (int i = 0; i < x.length; i++) {
    int key =  (x[i] << 16) | (y[i] & 0xFFFF);
    counts.adjustOrPutValue(key, 1, 1);
  }
  long time = System.nanoTime() - start;
  System.out.printf("Took %.3f seconds to use TIntIntHashMap%n", time/1e9);
  return counts;
}

prints

Took 1.592 seconds to use TIntIntHashMap
Took 4.889 seconds to use Map<String, Short>

If you have double co-ordinates, you need to use a two tier map.

public static void main(String... args) throws IOException {
  int length = 10 * 1000 * 1000;
  double[] x = new double[length];
  double[] y = new double[length];

  Random rand = new Random();
  for (int i = 0; i < length; i++) {
    x[i] = (rand.nextInt(10000) - rand.nextInt(10000)) / 1e4;
    y[i] = (rand.nextInt(10000) - rand.nextInt(10000)) / 1e4;
  }

  countPointsWithLongIntMap(x, y);
  countPointsWithMap(x, y);

}

private static Map<String, Short> countPointsWithMap(double[] x, double[] y) {
  long start = System.nanoTime();
  Map<String, Short> counts = new LinkedHashMap<String, Short>();
  for (int i = 0; i < x.length; i++) {
    String key = x[i] + "," + y[i];
    Short s = counts.get(key);
    if (s == null)
      counts.put(key, (short) 1);
    else
      counts.put(key, (short) (s + 1));
  }
  long time = System.nanoTime() - start;
  System.out.printf("Took %.3f seconds to use Map<String, Short>%n", time / 1e9);

  return counts;
}

private static TDoubleObjectHashMap<TDoubleIntHashMap> countPointsWithLongIntMap(double[] x, double[] y) {
  long start = System.nanoTime();
  TDoubleObjectHashMap<TDoubleIntHashMap> counts = new TDoubleObjectHashMap<TDoubleIntHashMap>();
  for (int i = 0; i < x.length; i++) {
    TDoubleIntHashMap map = counts.get(x[i]);
    if (map == null)
      counts.put(x[i], map = new TDoubleIntHashMap());
    map.adjustOrPutValue(y[i], 1, 1);
  }
  long time = System.nanoTime() - start;
  System.out.printf("Took %.3f seconds to use TDoubleObjectHashMap<TDoubleIntHashMap>%n", time / 1e9);
  return counts;
}

prints

Took 3.023 seconds to use TDoubleObjectHashMap<TDoubleIntHashMap>
Took 7.970 seconds to use Map<String, Short>
Peter Lawrey
  • 525,659
  • 79
  • 751
  • 1,130
  • Great, thanks for the advice. I will implement that now and get back to you. I guess I had initially thought that you'd be doing just as much work when trying to combine all the HashMaps together, but it may work better when dealing with threads, which I am no professional with. – Mojave Storm Aug 27 '11 at 20:38
  • 1
    The number of operations is the same, however each CPU has its own cache and a local Map can use that cache. If you have one shared Map it will be in the slowest cache (at best) – Peter Lawrey Aug 27 '11 at 20:41
  • 1
    I would also see if you can avoid using String as the key as this is likely to 10-100x slower than using a long with a TLongIntHashMap for example. – Peter Lawrey Aug 27 '11 at 20:42
  • I see! Makes sense, and hopefully we'll see the results soon. The problem is I'm not quite sure how to take coordinates and map them uniquely to a long value easily. I guess you could number them left to right, top to bottom and then take the division as the row and remainder as the column value. Not sure how much overhead this would create. Probably less than using a string. Might try it! – Mojave Storm Aug 27 '11 at 20:53
  • 1
    The reason for using integer as key is that for String a hash has to be computed but for integer the value is the hash code. – Rostislav Matl Aug 27 '11 at 22:04
  • 1
    Using a long avoid the need for an object (creating a String makes several) say you have two int values for co-ordinates, you can turn this into a long with `(long) x << 32 | y` – Peter Lawrey Aug 28 '11 at 06:56
  • Thanks for your help and the comparison on the run times of the maps. I didn't realize how much of an improvement the TObjectHashMaps were. I've implemented the program as mentioned before, and the problem I'm running into is the amount of memory each individual thread's hashmap is taking up ( > 46GB of memory because my data is so large). It's kind of blowing my mind how much memory this is taking, even if I use TLongByteHashMaps in each thread. I might just need to take a subset of my data I guess. Marked your answer as correct because it did improve aspects of my program. Thanks! – Mojave Storm Aug 28 '11 at 14:32
  • 1
    Sounds like you have too many thread if its using up 46 GB. I would start with one thread per core e.g. 4 - 12. If you are using this much memory, do you 9.5 billion points? A few million shouldn't use up a GB no matter how you do it. – Peter Lawrey Aug 28 '11 at 15:06
  • Peter, please review my edited post, as I have included further explanation of my problem. Thanks again for your time, I think I'm missing something obvious, but I'm just ignorant to it. – Mojave Storm Aug 28 '11 at 17:59
  • Can you provide any sample code which reproduces your problem? e.g. with some mock data like I have. – Peter Lawrey Aug 28 '11 at 22:01