How can I cluster SIFT descriptors with Apache Spark kmeans (via pickle or not)

Question

Using OpenCV 3.1 I've calculated the SIFT descriptors for an batch of images. Each descriptor has a shape (x, 128) and I've used the pickle based .tofile function to write each descriptor to disk. In a sample of the images x is between 2000 and 3000

I'm hoping to make use of Apache Spark's kmeans clustering via pyspark but my question is 2 parts.

is pickling the best way to transfer the descriptor data
how do I get from the bunch of pickle files to a cluster ready dataset and what pitfalls should I be aware of (Spark, pickling, SIFT)

My interest is in what the sequence would look like for python 2 code assuming there is some common storage between the descriptor generation code and the clustering environment

score 1 · Accepted Answer · answered Aug 31 '16 at 23:46

1

Is pickling the best way to transfer the descriptor data?

best is very specific here. You could try pickle or protobuf.

How do I get from the bunch of pickle files to a cluster ready dataset?

Deserialize your data.
Create an RDD, that will wold the vectors (i.e. every element of the RDD will be a feature, a 128 dimensional vector)).
Cache the RDD, since kMeans will use it again and again.
Train the kMeans model, to get your cluster.

For example, the LOPQ guys, do:

C0 = KMeans.train(first, V, initializationMode='random', maxIterations=10, seed=seed)

where first is the RDD I am mentioning, V is the number of clusters and C0 the computed cluster (check it at line 67 in GitHub).

Unpersist your RDD.

answered Aug 31 '16 at 23:46

gsamaras

71,951
46
188
305

Can I have spark start to persist the rdd while it's ingesting from multiple CSV files on cloud storage or is there a way to see how large the rdd would be and hence the amount of ram I need spark to have access to? – mobcdi Sep 01 '16 at 00:03
@Michael Spark evaluates statements lazily. As a result, it will do some actual work, only when an *action* occurs, not a *transformation*, so the answer to that is no. BTW nice question, u got my upvote with pride! :) BTW, if you know about [kmeans](http://stackoverflow.com/questions/39260820/is-sparks-kmeans-broken) and I would really need some help here... – gsamaras Sep 01 '16 at 00:09

How can I cluster SIFT descriptors with Apache Spark kmeans (via pickle or not)

1 Answers1