What Data Structure is faster for big amount of data?

Question

I am working on a project on Spring Boot and have to process a lot of information stored in Solr. I have to compare all my stored images with the entered by the user and establish a similitude. I used LinkedList of images at the beginning, now working with Arrays and LinkedList, but is also very slow and sometimes not working. I am talking about 11 000 000 images that I have to process. Here is my code:

 public LinkedList<Imagen> comparar(Imagen[] lista, Imagen imagen) throws NullPointerException {
    LinkedList<Imagen> resultado = new LinkedList<>();
    for (int i = 0; i < lista.length; i++) {
        if (lista[i].getFacesDetectedQuantity() == imagen.getFacesDetectedQuantity()) {
            lista[i].setSimilitud(3);
        }
        if (herramientas.rangoHue(imagen.getPredominantColor_hue()).equals(herramientas.rangoHue(lista[i].getPredominantColor_hue()))) {
            lista[i].setSimilitud(3);
        }
        if (lista[i].isTransparency() == imagen.isTransparency()) {
            lista[i].setSimilitud(4);
        }
        if (analizar.compareFeature(herramientas.image64ToImage(lista[i].getLarge_thumbnail()), herramientas.image64ToImage(imagen.getLarge_thumbnail())) > 400) {
            lista[i].setSimilitud(3);
        }
        if (analizar.compare_histogram(herramientas.image64ToImage(lista[i].getLarge_thumbnail()), herramientas.image64ToImage(imagen.getLarge_thumbnail())) > 90) {
            lista[i].setSimilitud(3);
        }
        if (lista[i].getSimilitud() > 7) {
            resultado.add(lista[i]);
        }
    }
    return ordenarLista(resultado);
}


public LinkedList<Imagen> ordenarLista(LinkedList<Imagen> lista) {
    LinkedList<Imagen> resultado = new LinkedList<>();
    for (int y = 0; y < lista.size(); y++) {
        Imagen imagen = lista.get(0);
        int posicion = 0;
        for (int x = 0; x < lista.size(); x++) {
            if (lista.get(x).getSimilitud() > imagen.getSimilitud()) {
                imagen = lista.get(x);
                posicion = x;
            }
        }
        resultado.add(imagen);
        lista.remove(posicion);
    }
    return resultado;
}

Any idea of what data structure could I use to make the process faster. I also was thinking on make every comparative if inside a thread but also not idea how to do that. A lot of googling and nothing found. Sorry for my English and Thanks!!!

I solved the problem of sorting with ordenarLista() method just ignoring it and add this code on my comparar() method before returning the list.

Collections.sort(resultado, new Comparator<Imagen>() {

            @Override
            public int compare(Imagen image1, Imagen image2) {
                return image2.getSimilitud() - image1.getSimilitud();
            }
        });

Still working on my algorithm!

Its not a choice of data structure here, It is how you are going to process them. Your approach of Threads is right. But to avoid hazzle in developing concurrency issues, try using RxJava library or Use MapReduce in case you have that flexibility to use HDFS. — Karthik, Mar 21 '18 at 05:01
But I am having trubles even to get them form Solr, a lot of time and memory required. Not sure how to handle with that amount. I am getting crazy with this. I will try to find out about what you suggest. Thanks — Andres Molina, Mar 21 '18 at 05:10
Do you really need to load all the images from Solr? I think that most of the image features that you compare could be calculated while storing the image in solr and then you could just search the images based on that and receive only the ones matching. — Mateusz Mrozewski, Mar 21 '18 at 05:25

davidxxx · Answer 1 · 2018-03-21T05:30:47.587

In a general way, before trying to optimize any part at random, use monitoring tool as JVisualVM to detect exactly the costly invocations. You have to place efforts at the correct place.

Besides, tracing the time elapsed for the first big processing (before ordenarLista()) and the second one (ordenarLista()) should be helpful too.

Actually, I note some things :

1) Very probably an issue : comparar() does many duplication processings that can be expensive in terms of CPU.

Look at these two invocations:

if (analizar.compareFeature(herramientas.image64ToImage(lista[i].getLarge_thumbnail()), herramientas.image64ToImage(imagen.getLarge_thumbnail())) > 400) {
    lista[i].setSimilitud(3);
}
if (analizar.compare_histogram(herramientas.image64ToImage(lista[i].getLarge_thumbnail()), herramientas.image64ToImage(imagen.getLarge_thumbnail())) > 90) {
    lista[i].setSimilitud(3);
}

You invoke for example 4 times herramientas.image64ToImage() at each iteration.

This should be executed once before the loop :

herramientas.image64ToImage(imagen.getLarge_thumbnail())

But you execute it millions of times in the loop. Just store the result in a variable before the loop and use it in. The same thing for :

herramientas.rangoHue(imagen.getPredominantColor_hue()

All computations that depend only of the Imagen imagen parameter should be computed before the loop and never in to spare millions of them.

2) ordenarLista() seems having an issue : you hardcoded the first index here :

Imagen imagen = lista.get(0);

3) ordenarLista() iterates potentially many times :

lista.size() + lista.size() 
+
lista.size()-1 + lista.size() 
+
lista.size()-2 + lista.size() 
+
...
+ 1 * lista.size()

Imagine with 1.000.000 of elements in :

1.000.000 + 1.000.000 
+
999.999  + 1.000.000 
+
999.998  + 1.000.000 
+
...
+ 
1 + 1.000.000

It makes many millions...

@Heikel Molina You are welcome. You should probably rethink the actual logic of `ordenarLista()`. Are all these iterations (with many duplicate comparisons) make sense ? I don't know you functional processing. So I cannot judge this part. — davidxxx, Mar 21 '18 at 05:36
I just want to get a list with the biggest "similitud" at the begining. I will rethink the logic. Thanks again! — Andres Molina, Mar 21 '18 at 05:40
Already solved the `ordenarLista()` problem. Not sure if it is the best way but it is what I googled. I have been in University learning Java for 5 years but never faced a real problem. Now I am really "learning". — Andres Molina, Mar 22 '18 at 05:55
Excellent news. I hesitated to propose the `Comparator` approach initially as I was not sure of your requirement. If you solve all your issues, don't hesitate to write an answer. It could help others. — davidxxx, Mar 22 '18 at 08:50

user207421 · Answer 2 · 2018-03-22T04:10:58.753

0

If you're using get(int) you should certainly be using ArrayList, not LinkedList.

However it's not just your data structures, it's your terrible algorithms.

For example, in your ordenarLista() method, lista.get(0) should be lista.get(y), and posicion = 0 should be posicion = y, and the inner loop should start from y+1. not zero.

Or else you don't need the outer loop at all.

edited Mar 22 '18 at 04:10

answered Mar 21 '18 at 05:07

user207421

305,947
44
307
483

score 0 · Answer 3 · answered Mar 21 '18 at 05:30

Don't really understand what you did. But I think you are searching things linearly, just some if else cannot make things better. Using the BTree algorithm to sort and search may be a good idea, every database use that algorithm. As you can see, a database usually do a great job at querying records.

BTree java example.

Just in case you don't understand what BTree is: Wikipedia

But never use a real database for storing image. Reason

score 0 · Answer 4 · answered Mar 21 '18 at 05:50

0

Seems like you may need a help from this: java.util.concurrent.Future.

You can try to split the for loop in public LinkedList<Imagen> comparar(Imagen[] lista, Imagen imagen) using this java.util.concurrent.Future and see if it reduced the time it process.

If the speed is reduced you can then again add the java.util.concurrent.Future in the for loop of public LinkedList<Imagen> ordenarLista(LinkedList<Imagen> lista)

answered Mar 21 '18 at 05:50

pvma

403
8
14

Hey thanks! I already used `Future` and it worked fine. The process is faster now like 70%. – Andres Molina Mar 23 '18 at 05:43

TilmannZ · Answer 5 · 2018-03-25T20:57:10.717

I suspect using a 'List' may not be a good choice for what you are doing (this answer contains quite some guess, as I'm not quite sure of the intentions of your program, I still hope it is useful).

If your program tries to detect similar images, there are already a number of algorithms and library to compare image similarity, for example here.

If you don't want to change you approach too much, or if it's not about image similarity, a multi-dimensional index may be something you should look at.

For example, it looks like you calculate certain values for each image (hue, a number of histogram values, number of faces). You can pre-calculate all these values once for each image and then put them in a large vector/array:

double[] calculateVector(Image image) {
    //put all image characteristics into a single array
    double[] vector = new double[]{hue, #of faces, hist value 1, histo value 2, ...};
    return vector;
}

This may give you one vector/array per image with, say, 20 'double' value. Then you use a multi-dimensional index, such as KD-Tree or R*Tree (there are some sample implementations in my own project).

KDTree allImages = new KDTree(20);
for (Image image : all images) {
    double[] vector = calculateVector(image);
    kdtree.put(vector, image);
}

Now if you have a new image and want to find the 5 most similar images you calculate the vector/array for the new image and the perform a kNN-query (k-nearest neighbor query) on the index.

double[] newImageVector = calculateVector(newImage); 
List result = kdtree.queryKNN(newImageVector, 5);  //method names may vary between implementation

This gives you a list of the 5 most similar images. This is usually very fast, the complexity is about O(log n), and you should be able to execute it several 1000 times per second. If you want to know more about multi-dim indexing, search the web for 'kNN query'.

What Data Structure is faster for big amount of data?

5 Answers5