I'm building an application that finding all similar images from user's input image using Hadoop.
I'm implementing it in two ways:
Way 1: My collection images is converted to SequenceFile to be used as input for map function. Then in map function, use OpenCV libary for compare similarity between those images with user's input image which include these steps: - Extract keypoints - Compute descriptors - Calculate distance between each pairs to find the similarity In Reduce function, I just copy images that is similar to output folder.
Way 2:
Similar with way 1 except:
I use Hbase to store image features first (keypoints, descriptors). To do that, because OpenCV doesnt support the way to convert keypoints, descriptors data type to bytes[] directly (in order to insert data to Hbase, we have to convert to bytesl[]) so I have to use a trick that I refer in this: OpenCV Mat object serialization in java
Then in map function, I will just query image features from Hbase to compare with user'input image feature.
In normal thought, we can see that saving all image features to a database then just query them to compare with user's input image will be faster than in each map function we have to start extract these feature to do comparison.
But in fact, when I do and test two ways in my virtual machine (standalone mode), I see that the way 2 run slower than way 1 and running time is not acceptable. In my opinion, I think that the way 2 run slowly because in the map function, it takes much time to convert from bytes[] value in Hbase to keypoints, descriptors datatype in openCV to do comparison. That why it degrades the performance of whole map function.
My collection images are just include 240 images in jpg format.
So my question here that, beside the reason I think as above that make way 2 run slower than way 1. Is there any reason else that make way 2 run slower than way 1 such as:
- Run in standalone mode is not recommended for using Hbase?
- Input size is not enough big to use Hbase?
Thanks.