-1

I tried to find out why a part of my application runs very slow. I used 'jmc' over 5 Minutes and ran that part of my application which takes so long. Analysing the methods-Section, I found out that 66% of the time were due to one function (no method-call inside showed up).

The method looks like this and is called about 4 million times:

public DataCell getKNIMECell(int rowIdx) {

    if(m_missingFlags.contains(rowIdx))
        return DataType.getMissingCell();

    switch(m_type) {
    case R_LOGICAL:
        return BooleanCellFactory.create((boolean)m_data[rowIdx]);
    case R_INT:
        return IntCellFactory.create((int) m_data[rowIdx]);
    case R_DOUBLE:
        return DoubleCellFactory.create((double) m_data[rowIdx]);
    case R_FACTOR:
    case R_STRING:
        return StringCellFactory.create((String) m_data[rowIdx]);
    default:
    }
    return null;
}

m_type is a class member and an enum defined within another class like this:

public enum RType { R_DOUBLE, R_LOGICAL, R_INT, R_STRING, R_FACTOR };

The array m_data is of type 'Object' and has around 4 million entries. m_missingFlag is a ArrayList<Integer>. I really don't know how to speed up that part of the code. Any ideas? As I said, none of the calls within that method seems to take a lot of time.

Antje Janosch
  • 1,154
  • 5
  • 19
  • 37
  • 1
    whats m_missingFlags ? – Sleiman Jneidi Nov 20 '15 at 13:44
  • And is there something special done by these factories? – Tom Nov 20 '15 at 13:45
  • 1. use if else instead of switch case 2. try reuse objects when possible instead of creating new ones 3. rework contains check – AdamSkywalker Nov 20 '15 at 13:45
  • Which concrete implementation of [Collection](http://docs.oracle.com/javase/8/docs/api/java/util/Collection.html) is `m_missingFlags`? – A4L Nov 20 '15 at 13:46
  • @Tom: as I said these factories do not pop up to consume a lot of time. – Antje Janosch Nov 20 '15 at 13:47
  • m_missingFlags is an ArrayList – Antje Janosch Nov 20 '15 at 13:49
  • @AdamSkywalker: can you be a bit more concrete concerning 2) and 3) ? – Antje Janosch Nov 20 '15 at 13:49
  • 1
    If this list does not have to contain duplicate entries (I highly doubt it should), then use a [HashSet](http://docs.oracle.com/javase/8/docs/api/java/util/HashSet.html) instead. – A4L Nov 20 '15 at 13:50
  • @July AdamSkywalker is trying to find out if you can reuse DataCell objects since that appears to be what is actually slow... ie the construction of `DataCell`. – Adam Gent Nov 20 '15 at 13:51
  • 1
    2. there are two possible vals for boolean, you can create two cells instances for each one and reuse them, same for other data types if you know the input data 3. contains is O(N) for array list, try hash set instead or even more clever data structure (depends on the input data) – AdamSkywalker Nov 20 '15 at 13:51
  • If the method calls invoked within this method aren't showing up as taking a lot of time, then the load must be spread rather uniformly among them. The method itself _can't_ take a lot of time aside from the method calls it invokes. There just isn't anything there. – Erick G. Hagstrom Nov 20 '15 at 13:52
  • 66% of the time in the method does not mean in _one_ call. I guess the method is called 4 million times ... Maybe you should consider using a datastructure that does not have to iterate 4 million entries of which most will be "missing" ... – Fildor Nov 20 '15 at 13:55
  • @AdamSkywalker: okay, it should be fine to reuse two different BooleanCell objects. In case of the others it is pretty likely that the content changes and I only can set the content within the constructor. – Antje Janosch Nov 20 '15 at 13:56
  • 2
    static casting can be quite expensive, you should change your data structure -- casting in itself is the worst possible approach. Instead you should store and retrieve the objects in their original type. Also : Iteration over large sets of similar objects can be parallelized, simply use a [parallel stream](https://docs.oracle.com/javase/tutorial/collections/streams/parallelism.html) and you will see massive performance improvements – specializt Nov 20 '15 at 13:57
  • Is this method invoked for every element in m_data? If so, does the rowIdx parameter increment sequentially, or is it used for random access? – Matt Stephenson Nov 20 '15 at 13:58
  • @MattStephenson: it increments sequentially, yes and it is for every element in the data – Antje Janosch Nov 20 '15 at 14:01
  • 1
    Wow! To everybody proposing to use HashSet instead of ArrayList: THANK YOU! That speeds up enormously. :-) – Antje Janosch Nov 20 '15 at 14:04
  • 1
    beware : `HashSet` relies on `hashCode()` of every object in the set - if you dont explicitly define the body of this method a standard algorithm is used - which _may_ produce collisions once you get past a few thousand entries ... but that _usually_ that doesnt happen. Its basically a game of luck with luck being on your side --- yet its still a game of chance / uncertainty. – specializt Nov 20 '15 at 14:18
  • apparently its a random number : http://stackoverflow.com/a/32454673/351861. These _usually_ produce collisions after millions of invocations. Beware. – specializt Nov 20 '15 at 14:25

2 Answers2

1

m_missingFlags is an ArrayList<>

This may be your bottleneck - if the list is big. Try using a HashSet.

OldCurmudgeon
  • 64,482
  • 16
  • 119
  • 213
0

My guess (and I just decided to write an answer because the comments were getting overwhelming) is that the performance issue is because of the large object array and call contains List<Boolean> missing flags (also we have no idea what list implementation that is as well).

My approaches to fix this would be to

  1. cache DataCell (I hope its immutable) (ie particularly for boolean)
  2. use a different data structure for m_missingFlags (ie bloom filter or some tree, or hash).
  3. create an array per data type (this avoids some casting issues but costs more memory).

That is roughly the order I would try things but your mileage may vary as I have no idea how or what DataCell is composed of.

Adam Gent
  • 47,843
  • 23
  • 153
  • 203