5

I am trying to implement K-nearest neighbor algorithm in Spark. I was wondering if it is possible to work with nested RDD's. This will make my life a lot easier. Consider the following code snippet.

public static void main (String[] args){
//blah blah code
JavaRDD<Double> temp1 = testData.map(
    new Function<Vector,Double>(){
        public Double call(final Vector z) throws Exception{
            JavaRDD<Double> temp2 = trainData.map(
                    new Function<Vector, Double>() {
                        public Double call(Vector vector) throws Exception {
                            return (double) vector.length();
                        }
                    }
            );
            return (double)z.length();
        }    
    }
);
}

Currently I am getting error with this nested settings (I can post here the full log). Is it allowed in the fist place? Thanks

Hleb
  • 7,037
  • 12
  • 58
  • 117
Rajiur Rahman
  • 307
  • 2
  • 11

2 Answers2

6

No, it is not possible, because the items of an RDD must be serializable and a RDD is not serializable. And this makes sense, otherwise you might transfer over the network a whole RDD which is a problem if it contains a lot of data. And if it does not contain a lot of data, you might and you should use an array or something like it.

However, I don't know how you are implementing the K-nearest neighbor...but be careful: if you do something like calculating the distance between each couple of point, this is actually not scalable in the dataset size, because it's O(n2).

mgaido
  • 2,987
  • 3
  • 17
  • 39
  • Thanks Mark. Your input makes sense to me. However, I was thinking to reduce the nearest neighbor of each test instance by this method. As I was also thinking nested RDDs might not be possible, I have started implementing in a different, old fashioned way. – Rajiur Rahman Apr 21 '15 at 18:17
  • There are smarter ways to do that maybe. For instance, in the DBSCAN implementation you find in the Spark machine learning library, the whole dataset space is divided in boxes, so that the complexity to compute neighbors goes down. If you are interested in it you find the code on github and that may be a a good way to improve performance (actually what it does is much more complex than what I told you, but this is the underlying idea). – mgaido Apr 21 '15 at 20:34
  • Just for the record - RDD is serializable. It is really not a serialization issue. – zero323 Mar 24 '16 at 14:25
  • so this is negative, comparing spark with a database. – gtzinos May 12 '18 at 04:57
  • @gtzinos it is just not the way you should do it with Spark...you can perfectly do a cartesian product in Spark as well as you do in database, but not in this way. And if both the RDDs are big this is not a good idea, as well as it is not with 2 big tables in a DB. – mgaido May 13 '18 at 07:37
  • Sure but sometimes i see that using data frame to join two rdds is easier from spark join method. – gtzinos May 14 '18 at 07:43
2

I ran into nullpointer exception while trying something of this sort.As we can't perform operations on RDDs within a RDD.

Spark doesn't support nesting of RDDs the reason being - to perform an operation or create a new RDD spark runtime requires access to sparkcontext object which is available only in the driver machine.

Hence if you want to operate on nested RDDs, you may collect the parent RDD on driver node then iterate it's items using array or something.

Note:- RDD class is serializable. Please see below.

enter image description here