2

I am currently trying to join two DataSets (part of the flink 0.10-SNAPSHOT API). Both DataSets have the same form:

predictions:
6.932018685453303E155 DenseVector(0.0, 1.4, 1437.0)

org:
2.0 DenseVector(0.0, 1.4, 1437.0)

general form:
LabeledVector(Double, DenseVector(Double,Double,Double))

What I want to create is a new DataSet[(Double,Double)] containing only the labels of the two DataSets i.e.:

join:
6.932018685453303E155 2.0

Therefore I tried the following command:

val join = org.join(predictions).where(0).equalTo(0){
  (l, r) => (l.label, r.label)
}

But as a result 'join' is empty. Am I missing something?

Till Rohrmann
  • 13,148
  • 1
  • 25
  • 51
Flow
  • 81
  • 6

1 Answers1

3

You are joining on the label field (index 0) of the LabeledVector type, i.e., building all pairs of elements with matching labels. Your example indicates that you want to join on the vector field instead.

However, joining on the vector field, for example by calling:

org.join(predictions).where("vector").equalTo("vector"){
  (l, r) => (l.label, r.label)
}

will not work, because DenseVector, the type of the vector field, is not recognized as key type by Flink (such as all kinds of Arrays).

Till describes how to compare prediction and label values in a comment below.

Fabian Hueske
  • 18,707
  • 2
  • 44
  • 49
  • Ah it seems that I understand now how "join" really works. I thought that the datasets do not need any matching labels and join simply "joins" the datasets. Thank you. But the command do not work. I will try again now. – Flow Aug 13 '15 at 13:14
  • @Flow, the problem is that `Vector` is no key type. Therefore you cannot join on them. If you want to compare the prediction values with the original values produced by a `Predictor`, then I recommend using the `evaluate` method. This method gives you a `DataSet[(OriginalValue, PredictionValue)]`. Alternatively, you can create tuples of `LabeledVector` and an ID on which you can then join. – Till Rohrmann Aug 13 '15 at 15:29
  • Yeah, I've managed it by your alternative idea. Thank you guys. – Flow Aug 13 '15 at 16:03