7

I have been thinking of it but have ran out of idea's. I have 10 arrays each of length 18 and having 18 double values in them. These 18 values are features of an image. Now I have to apply k-means clustering on them.

For implementing k-means clustering I need a unique computational value for each array. Are there any mathematical or statistical or any logic that would help me to create a computational value for each array, which is unique to it based upon values inside it. Thanks in advance.

Here is my array example. Have 10 more

[0.07518284315321135    
0.002987851573676068    
0.002963866526639678    
0.002526139418225552    
0.07444872939213325 
0.0037219653347541617   
0.0036979802877177715   
0.0017920256571474585   
0.07499695903867931 
0.003477831820276616    
0.003477831820276616    
0.002036159171625004    
0.07383539747505984 
0.004311312204791184    
0.0043352972518275745   
0.0011786937400740452   
0.07353130134299131 
0.004339580295941216]
DarkHorse
  • 2,740
  • 19
  • 28
  • Check this out - http://stackoverflow.com/questions/21111070/implementation-of-k-means-clustering-algorithm – Keerthivasan Mar 12 '14 at 05:46
  • 1
    @Octopus Checked, It works on single values, I have an 10 such arrays that I have to use for clustering. 1 array = single image features. In short I have to create cluster of similar images – DarkHorse Mar 12 '14 at 05:52
  • The first approach could be based on `Arrays.hashCode(doubleArray)`. This is **not** unique for the array, but ... there is no unique representation of 18 double values that is much smaller than the 18 double values itself, anyhow. For 10 arrays, the chance that two arrays have the same hashCode value should already be very, very low, and in doubt, could be verified and handled manually. But all this might not help you much: Of you intend to perform a clustering on these "IDs", then the IDs must preserve *similarities* - is this correct? – Marco13 Mar 12 '14 at 09:30
  • @Marco13 Yes, I intend to preserve similarities in that unique computated value.. Will hash-code will preserve them..? I have just take 10 images for now, the database will contain thousands of images – DarkHorse Mar 12 '14 at 09:39
  • No, hashCode will **not** preserve similarities. In fact, a similarity-preserving reduction of dimensionality is a common problem, but there are some fairly mature tools available. You might want to have a look at http://en.wikipedia.org/wiki/Dimensionality_reduction#Dimension_reduction , and there are also some Java libraries available, e.g. for a PCA – Marco13 Mar 12 '14 at 09:45
  • Ok I am having a look at it, they rpobably reduce the values to managable ones, but not to single one, will have to explore... – DarkHorse Mar 12 '14 at 10:07
  • @DarkHorse why do you need a single value per se? k-means works on d-dimensional real vectors which is what you have with your 10 arrays. The values of your arrays would give it a unique position in that 18-dimensional space. – xlm Mar 14 '14 at 10:15
  • Wouldn't a merkle tree work for this. It seems to meet your requirements of a computational value along with an ability to check for similarity. http://en.wikipedia.org/wiki/Merkle_tree – mikea Mar 14 '14 at 10:18
  • @xlm I am keen to make it work on single dimension. Won't it be better if I can represent each array by single value uniquely. It would also work where we can check similarity between array's.. – DarkHorse Mar 15 '14 at 07:02
  • @mikea I dont think merkel tree would work, suppose my array1 is [2 3 4 5] and array2 is [ 3 2 5 4]. Merkel will proabably generate same hash for both arrays, thought they are different. – DarkHorse Mar 15 '14 at 07:09
  • @DarkHorse You can already calculate similarity between d-dimension vectors. Reduction from 18D to 1D is very ambitious, my concern is if you reduce 18D vectors into 1D, calculating similarity between them in 1D could be very misleading. – xlm Mar 15 '14 at 07:29
  • 1
    Develop an algorithm that will convert those numbers into base 18 or base 36 or base 72 (or more) `char` representations. Does it need to be numeric , if yes why ? As stated above you can not represent those doubles ('Real') uniquely just with the 10 (0-9) integer digits and with doubles smaller in length. But as I see your data set, you can safely remove the first 0 , the dot, and the following 0 from your reals, and represents them as integer, but take care of the leading zero(es) while converting them into integer number representations. –  Mar 19 '14 at 21:26
  • @Marco13 , Why the Arrays.hashcode(doubles) is not unique for array ?. Java 7 uses the order of elements also while computing hashcode. i confused.. – Mani Mar 20 '14 at 17:27
  • @Mani An `int` value like the hash code has 32 bits, and thus there are 4294967296 possible hash codes. But there are definitely more than 4294967296 possible `double[]` arrays. (In fact, there already are more than 4294967296 different `double` *values*). So there must be at least two `double[]` arrays that have the same hash code. – Marco13 Mar 20 '14 at 17:51
  • @DarkHorse - is there a range for the double values ? like in your sample it is 0 > x < 1 . is it true always ? – Mani Mar 20 '14 at 21:03
  • @Mani .. Yup there is range for double values its 10 > x > 0. This is always true, the values are never above 10 and never 0.000000000000000000.... – DarkHorse Mar 21 '14 at 04:20

7 Answers7

3

Did you checked the Arrays.hashcode in Java 7 ?

 /**
 * Returns a hash code based on the contents of the specified array.
 * For any two <tt>double</tt> arrays <tt>a</tt> and <tt>b</tt>
 * such that <tt>Arrays.equals(a, b)</tt>, it is also the case that
 * <tt>Arrays.hashCode(a) == Arrays.hashCode(b)</tt>.
 *
 * <p>The value returned by this method is the same value that would be
 * obtained by invoking the {@link List#hashCode() <tt>hashCode</tt>}
 * method on a {@link List} containing a sequence of {@link Double}
 * instances representing the elements of <tt>a</tt> in the same order.
 * If <tt>a</tt> is <tt>null</tt>, this method returns 0.
 *
 * @param a the array whose hash value to compute
 * @return a content-based hash code for <tt>a</tt>
 * @since 1.5
 */
public static int hashCode(double a[]) {
    if (a == null)
        return 0;

    int result = 1;
    for (double element : a) {
        long bits = Double.doubleToLongBits(element);
        result = 31 * result + (int)(bits ^ (bits >>> 32));
    }
    return result;
}

I dont understand why @Marco13 mentioned " this is not returning unquie for arrays".

UPDATE

See @Macro13 comment for the reason why it cannot be unquie..


UPDATE

If we draw a graph using your input points, ( 18 elements) has one spike and 3 low values and the pattern goes.. if that is true.. you can find the mean of your Peak ( 1, 4, 8,12,16 ) and find the low Mean from remaining values.

So that you will be having Peak mean and Low mean . and you find the unquie number to represent these two also preserve the values using bijective algorithm described in here

This Alogirthm also provides formulas to reverse i.e take the Peak and Low mean from the unquie value.

To find unique pair < x; y >= x + (y + ( (( x +1 ) /2) * (( x +1 ) /2) ) )

Also refer Exercise 1 in pdf page 2 to reverse x and y.

For finding Mean and find paring value.

public static double mean(double[] array){
    double peakMean = 0;
    double lowMean = 0;
    for (int i = 0; i < array.length; i++) {
        if ( (i+1) % 4 == 0 || i == 0){
            peakMean = peakMean + array[i];
        }else{
            lowMean = lowMean + array[i];
        }
    }
    peakMean = peakMean / 5;
    lowMean = lowMean / 13;
    return bijective(lowMean, peakMean);
}



public static double bijective(double x,double y){
    double tmp = ( y +  ((x+1)/2));
    return x +  ( tmp * tmp);
}

for test

public static void main(String[] args) {
    double[] arrays = {0.07518284315321135,0.002963866526639678,0.002526139418225552,0.07444872939213325,0.0037219653347541617,0.0036979802877177715,0.0017920256571474585,0.07499695903867931,0.003477831820276616,0.003477831820276616,0.002036159171625004,0.07383539747505984,0.004311312204791184,0.0043352972518275745,0.0011786937400740452,0.07353130134299131,0.004339580295941216};
    System.out.println(mean(arrays));
}

You can use this the peak and low values to find the similar images.

Community
  • 1
  • 1
Mani
  • 3,274
  • 2
  • 17
  • 27
  • Your updated answer is stuck for a pattern, but the arrays don't follow a pattern like 1 spike and 3 low values, I get some arrays with 2 spikes or 3 spikes .. its random.. But use of bijective is really an addition to your answer.. Array can be sorted in descending and a mid-value can be calculated to find peak mean and low mean but positioning of values reflects similarity. that will be affected. But if I had a similar pattern , this would have been a perfect answer :) – DarkHorse Mar 22 '14 at 05:04
2

You can simply sum the values, using double precision, the result value will unique most of the times. On the other hand, if the value position is relevant, then you can apply a sum using the index as multiplier.

The code could be as simple as:

public static double sum(double[] values) {
    double val = 0.0;
    for (double d : values) {
        val += d;
    }
    return val;
}

public static double hash_w_order(double[] values) {
    double val = 0.0;
    for (int i = 0; i < values.length; i++) {
        val += values[i] * (i + 1);
    }
    return val;
}

public static void main(String[] args) {
    double[] myvals =
        { 0.07518284315321135, 0.002987851573676068, 0.002963866526639678, 0.002526139418225552, 0.07444872939213325, 0.0037219653347541617, 0.0036979802877177715, 0.0017920256571474585, 0.07499695903867931, 0.003477831820276616,
                0.003477831820276616, 0.002036159171625004, 0.07383539747505984, 0.004311312204791184, 0.0043352972518275745, 0.0011786937400740452, 0.07353130134299131, 0.004339580295941216 };

    System.out.println("Computed value based on sum: " + sum(myvals));
    System.out.println("Computed value based on values and its position: " + hash_w_order(myvals));
}

The output for that code, using your list of values is:

Computed value based on sum: 0.41284176550504803
Computed value based on values and its position: 3.7396448842464496
Roberto
  • 8,586
  • 3
  • 42
  • 53
  • Sum wont work and nor mean.. but the idea of multiplying index positions with values seems to be good. Will have to check , how will they affect the results for this idea to be quite useful.. – DarkHorse Mar 17 '14 at 05:30
2

Well, here's a method that works for any number of doubles.

public BigInteger uniqueID(double[] array) {
    final BigInteger twoToTheSixtyFour = 
            BigInteger.valueOf(Long.MAX_VALUE).add(BigInteger.ONE);

    BigInteger count = BigInteger.ZERO;
    for (double d : array) {
        long bitRepresentation = Double.doubleToRawLongBits(d);
        count = count.multiply(twoToTheSixtyFour);
        count = count.add(BigInteger.valueOf(bitRepresentation));
    }
    return count;
}

Explanation

Each double is a 64-bit value, which means there are 2^64 different possible double values. Since a long is easier to work with for this sort of thing, and it's the same number of bits, we can get a 1-to-1 mapping from doubles to longs using Double.doubleToRawLongBits(double).

This is awesome, because now we can treat this like a simple combinations problem. You know how you know that 1234 is a unique number? There's no other number with the same value. This is because we can break it up by its digits like so:

1234 = 1 * 10^3 + 2 * 10^2 + 3 * 10^1 + 4 * 10^0

The powers of 10 would be "basis" elements of the base-10 numbering system, if you know linear algebra. In this way, base-10 numbers are like arrays consisting of only values from 0 to 9 inclusively.

If we want something similar for double arrays, we can discuss the base-(2^64) numbering system. Each double value would be a digit in a base-(2^64) representation of a value. If there are 18 digits, there are (2^64)^18 unique values for a double[] of length 18.

That number is gigantic, so we're going to need to represent it with a BigInteger data-structure instead of a primitive number. How big is that number?

(2^64)^18 = 61172327492847069472032393719205726809135813743440799050195397570919697796091958321786863938157971792315844506873509046544459008355036150650333616890210625686064472971480622053109783197015954399612052812141827922088117778074833698589048132156300022844899841969874763871624802603515651998113045708569927237462546233168834543264678118409417047146496

There are that many unique configurations of 18-length double arrays and this code lets you uniquely describe them.

Axoren
  • 623
  • 1
  • 8
  • 22
1

I'm going to suggest three methods, with different pros and cons which I will outline.

  1. Hash Code This is the obvious "solution", though it has been correctly pointed out that it will not be unique. However, it will be very unlikely that any two arrays will have the same value.

  2. Weighted Sum Your elements appear to be bounded; perhaps they range from a minimum of 0 to a maximum of 1. If this is the case, you can multiply the first number by N^0, the second by N^1, the third by N^2 and so on, where N is some large number (ideally the inverse of your precision). This is easily implemented, particularly if you use a matrix package, and very fast. We can make this unique if we choose.

  3. Euclidean Distance from Mean Subtract the mean of your arrays from each array, square the results, sum the squares. If you have an expected mean, you can use that. Again, not unique, there will be collisions, but you (almost) can't avoid that.

The difficulty of uniqueness

It has already been explained that hashing will not give you a unique solution. A unique number is possible in theory, using the Weighted Sum, but we have to use numbers of a very large size. Let's say your numbers are 64 bits in memory. That means that there are 2^64 possible numbers they can represent (slightly less using floating point). Eighteen such numbers in an array could represent 2^(64*18) different numbers. That's huge. If you use anything less, you will not be able to guarantee uniqueness due to the pigeonhole principle.

Let's look at a trivial example. If you have four letters, a, b, c and d, and you have to number them each uniquely using the numbers 1 to 3, you can't. That's the pigeonhole principle. You have 2^(18*64) possible numbers. You can't number them uniquely with less than 2^(18*64) numbers, and hashing doesn't give you that.

If you use BigDecimal, you can represent (almost) arbitrarily large numbers. If the largest element you can get is 1 and the smallest 0, then you can set N = 1/(precision) and apply the Weighted Sum mentioned above. This will guarantee uniqueness. The precision for doubles in Java is Double.MIN_VALUE. Note that the array of weights needs to be stored in _Big Decimal_s!

That satisfies this part of your question:

create a computational value for each array, which is unique to it based upon values inside it

However, there is a problem:

1 and 2 suck for K Means

I am assuming from your discussion with Marco 13 that you are performing the clustering on the single values, not the length 18 arrays. As Marco has already mentioned, Hashing sucks for K means. The whole idea is that the smallest change in the data will result in a large change in Hash Values. That means that two images which are similar, produce two very similar arrays, produce two very different "unique" numbers. Similarity is not preserved. The result will be pseudo random!!!

Weighted Sums are better, but still bad. It will basically ignore all the elements except for the last one, unless the last element is the same. Only then will it look at the next to last, and so on. Similarity is not really preserved.

Euclidean distance from the mean (or at least some point) will at least group things together in a sort of sensible way. Direction will be ignored, but at least things that are far from the mean won't be grouped with things that are close. Similarity of one feature is preserved, the other features are lost.

In summary

1 is very easy, but is not unique and doesn't preserve similarity.

2 is easy, can be unique and doesn't preserve similarity.

3 is easy, but is not unique and preserves some similarity.

Implementatio of Weighted Sum. Not really tested.

public class Array2UniqueID {

private final double min;
private final double max;
private final double prec;
private final int length;

/**
 * Used to provide a {@code BigInteger} that is unique to the given array.
 * <p>
 * This uses weighted sum to guarantee that two IDs match if and only if
 * every element of the array also matches. Similarity is not preserved.
 *
 * @param min smallest value an array element can possibly take
 * @param max largest value an array element can possibly take
 * @param prec smallest difference possible between two array elements
 * @param length length of each array
 */
public Array2UniqueID(double min, double max, double prec, int length) {
    this.min = min;
    this.max = max;
    this.prec = prec;
    this.length = length;
}

/**
 * A convenience constructor which assumes the array consists of doubles of
 * full range.
 * <p>
 * This will result in very large IDs being returned.
 *
 * @see Array2UniqueID#Array2UniqueID(double, double, double, int)
 * @param length
 */
public Array2UniqueID(int length) {
    this(-Double.MAX_VALUE, Double.MAX_VALUE, Double.MIN_VALUE, length);
}

public BigDecimal createUniqueID(double[] array) {
    // Validate the data
    if (array.length != length) {
        throw new IllegalArgumentException("Array length must be "
                + length + " but was " + array.length);
    }
    for (double d : array) {
        if (d < min || d > max) {
            throw new IllegalArgumentException("Each element of the array"
                    + " must be in the range [" + min + ", " + max + "]");
        }
    }

    double range = max - min;

    /* maxNums is the maximum number of numbers that could possibly exist
     * between max and min.
     * The ID will be in the range 0 to maxNums^length.
     * maxNums = range / prec + 1
     * Stored as a BigDecimal for convenience, but is an integer
     */
    BigDecimal maxNums = BigDecimal.valueOf(range)
            .divide(BigDecimal.valueOf(prec))
            .add(BigDecimal.ONE);
    // For convenience

    BigDecimal id = BigDecimal.valueOf(0);

    // 2^[ (el-1)*length + i ]
    for (int i = 0; i < array.length; i++) {
        BigDecimal num = BigDecimal.valueOf(array[i])
                .divide(BigDecimal.valueOf(prec))
                .multiply(maxNums).pow(i);

        id = id.add(num);
    }

    return id;

}
timbo
  • 1,533
  • 1
  • 15
  • 26
  • Thats quite a sound explanation. Thanks you ;) – DarkHorse Mar 22 '14 at 04:35
  • Considering euclidean distance from mean, what if I calculate mean by multiplying values but index positions, will it preserve similarity? I think it will .. right? – DarkHorse Mar 22 '14 at 07:08
  • Multiplying by index positions will give you a result like the Weighted Sum. In fact, it is a kind of Weighted Sum (using the index position as a weight). This doesn't guarantee uniqueness, nor does it preserve similarity, though it doesn't destroy all similarity either. – timbo Mar 22 '14 at 09:45
  • There is another option, a compromise between 2 and 3 that will preserve some similarity and guarantee uniqueness; using interleaving. I will try and edit and maybe put in some code to demonstrate. – timbo Mar 22 '14 at 09:46
  • OK, I've looked into using interleaving, and it's not easy. While the maths is fine on paper, the code has a problem; you have to raise a BigInteger to the power of another BigInteger. There is no built-in method for that. – timbo Mar 22 '14 at 11:38
  • Ok , I will try this, problem is I was damm busy with work to try out all suggestions programmatically, I have got this weekend to try find and accept an answer, sadly that bounty period proved less for me and half bounty got allocated to top upvoted answer sadly :(. Ok I will accept a right answer soon.. :) Thank you once again :) – DarkHorse Mar 22 '14 at 12:06
  • Wanted to add, to avoid confusion, that the Weighted Sum _can_ guarantee uniqueness, but _only_ if you multiply by a big enough number. Multiplying by the index doesn't necessarily work. Take the array [0 1] -> 0*1 + 1*2 = 2 and compare to [2 0] -> 2*1 + 0*2 = 2. The same applies to small numbers; [0 0.01] -> 0.02, [0.02 0] -> 0.02. The number you have to use is dependent on your precision. – timbo Mar 22 '14 at 12:07
  • Thats righty pointed, I too tried that on paper and for small int arrays and I got the same result.. a high index number changes the mean drastically – DarkHorse Mar 22 '14 at 12:09
  • Honestly though DarkHorse, I cannot see anything good coming of reducing 18 dimensions to 1 before you apply KMeans. Also, the only feasible means I can think of to ensure you get a unique number will destroy (or almost destroy) the benefits of KMeans. If you can explain further why you need this, there might be "fudges", like reserving a noise term, that could be used. – timbo Mar 22 '14 at 12:11
  • K-means can be applied on n-D its right, but thinking of representation (thinking little futuristic) if I had to show it on graph or chart then I am bound to atmost 1D, 2D and 3D not beyond. SO I thought that reducing to 1D will give some scope for representation. Also I didn't see any implementations so far for similarities of arrays for values in them...So this question popped in my mind, that y cannot a unique value can represent an array for values in them.. :) – DarkHorse Mar 22 '14 at 12:17
  • Sorry, was called away. Well, the Weighted Sum will give unique id for an array. As to the representation, it is common to use PCA for dimensionality reduction. Are you familiar with PCA? – timbo Mar 22 '14 at 14:31
0

As I understand, you are going to make k-clustering, based on the double values.

Why not just wrap double value in an object, with array and position identifier, so you would know in which cluster it ended up?

Something like:

 public class Element {
     final public double value;
     final public int array;
     final public int position;
     public Element(double value, int array, int position) {
         this.value = value;
         this.array = array;
         this.position = position;
     }
 }

If you need to cluster array as a whole,

  1. You can transform original arrays of length 18 to array of length 19 with last or first element being unique id, that you will ignore during clustering, but, to which you could refer after clustering finished. That way this have a small memory footprint - of 8 additional bytes for an array, and easy association with the original value.
  2. If space is absolutely a problem, and you have all values of an array lesser than 1, you can add unique id, greater or equal to 1 to each array, and cluster, based on reminder of division to 1, 0.07518284315321135 stays 0.07518284315321135 for the 1st, and 0.07518284315321135 becomes 1.07518284315321135 for the 2nd, although this increases complexity of computation during clustering.
mavarazy
  • 7,562
  • 1
  • 34
  • 60
  • I want to perform k-clustering on array's for double values, not on all double values in array's.. And this would add extra object creation for k-clustering, not feasible on thousands of array's.. – DarkHorse Mar 17 '14 at 05:33
  • Add extra element with unique id as a head of an array, that would be ignored by the clustering, that way it's only 1 additional element overhead. – mavarazy Mar 17 '14 at 05:39
  • Can you elaborate your idea a bit more in answer? – DarkHorse Mar 17 '14 at 05:42
  • Ok, your answer focuses more on how to cluster arrays, whereas my question focuses more on how to get unique single value for an array, so that I can perform k-means in 1D. As @xlm mentioned, k-means works in n-space, but I want to make it work in single space for n-array wid n-value.. – DarkHorse Mar 17 '14 at 06:16
  • If you are going to cluster, based on a single unique value, that does not make much sense, since your results will depend on the function you'll choose to generate this unique value, so they can't be relied upon. If you have already a function, that transforms your array in a single value, than you already have your unique identifier. – mavarazy Mar 17 '14 at 06:26
  • Yup, for now I don't have any such function, hence I need idea's for it. I will test its results for efficiency. If not, then I will have to move to D-space only.... – DarkHorse Mar 17 '14 at 06:32
0

First of all, let's try to understand what you need mathematically:

Uniquely mapping an array of m real numbers to a single number is in fact a bijection between R^m and R, or at least N.

Since floating points are in fact rational numbers, your problem is to find a bijection between Q^m and N, which can be transformed to N^n to N, because you know your values will always be greater than 0 (just multiply your values by the precision).

Thus you need to map N^m to N. Take a look at the Cantor Pairing Function for some ideas

Community
  • 1
  • 1
aviggiano
  • 1,204
  • 17
  • 23
0

A guaranteed way to generate a unique result based on the array is to convert it to one big string, and use that for your computational value.

  • It may be slow, but it will be unique based on the array's values.

Implementation examples: Best way to convert an ArrayList to a string

Community
  • 1
  • 1
Matt Woelk
  • 1,940
  • 1
  • 20
  • 24