1

Using OpenCV 3.1 I've calculated the SIFT descriptors for an batch of images. Each descriptor has a shape (x, 128) and I've used the pickle based .tofile function to write each descriptor to disk. In a sample of the images x is between 2000 and 3000

I'm hoping to make use of Apache Spark's kmeans clustering via pyspark but my question is 2 parts.

  1. is pickling the best way to transfer the descriptor data
  2. how do I get from the bunch of pickle files to a cluster ready dataset and what pitfalls should I be aware of (Spark, pickling, SIFT)

My interest is in what the sequence would look like for python 2 code assuming there is some common storage between the descriptor generation code and the clustering environment

Community
  • 1
  • 1
mobcdi
  • 1,532
  • 2
  • 28
  • 49

1 Answers1

1

Is pickling the best way to transfer the descriptor data?

best is very specific here. You could try pickle or protobuf.

How do I get from the bunch of pickle files to a cluster ready dataset?

  1. Deserialize your data.
  2. Create an RDD, that will wold the vectors (i.e. every element of the RDD will be a feature, a 128 dimensional vector)).
  3. Cache the RDD, since kMeans will use it again and again.
  4. Train the kMeans model, to get your cluster.

For example, the LOPQ guys, do:

C0 = KMeans.train(first, V, initializationMode='random', maxIterations=10, seed=seed)

where first is the RDD I am mentioning, V is the number of clusters and C0 the computed cluster (check it at line 67 in GitHub).

  1. Unpersist your RDD.
gsamaras
  • 71,951
  • 46
  • 188
  • 305
  • Can I have spark start to persist the rdd while it's ingesting from multiple CSV files on cloud storage or is there a way to see how large the rdd would be and hence the amount of ram I need spark to have access to? – mobcdi Sep 01 '16 at 00:03
  • @Michael Spark evaluates statements lazily. As a result, it will do some actual work, only when an *action* occurs, not a *transformation*, so the answer to that is no. BTW nice question, u got my upvote with pride! :) BTW, if you know about [kmeans](http://stackoverflow.com/questions/39260820/is-sparks-kmeans-broken) and I would really need some help here... – gsamaras Sep 01 '16 at 00:09