Using OpenCV 3.1 I've calculated the SIFT descriptors for an batch of images.
Each descriptor has a shape (x, 128)
and I've used the pickle based .tofile
function to write each descriptor to disk. In a sample of the images x is between 2000 and 3000
I'm hoping to make use of Apache Spark's kmeans clustering via pyspark but my question is 2 parts.
- is pickling the best way to transfer the descriptor data
- how do I get from the bunch of pickle files to a cluster ready dataset and what pitfalls should I be aware of (Spark, pickling, SIFT)
My interest is in what the sequence would look like for python 2 code assuming there is some common storage between the descriptor generation code and the clustering environment