I want to remove duplicated values in my data. I know it is frequently observed questions in stackoverflow but my problem is a little different because now I am handling very large size of data. Therefore I have to consider the execution time the most in my code.
As below snippet, I made a simple code for removing duplicated values.
// Suppose that the dataset is very huge so that
// multi-node resources should be necessary.
String[] data = new String[10_000_000];
HashMap<String, String> uniqueItems = new HashMap<>();
for (int i = 0; i < data.length; i++) {
if (uniqueItems.containsKey(data[i])) {
uniqueItems.remove(data[i]);
uniqueItems.put(data[i], "inserted");
} else {
uniqueItems.put(data[i], "inserted");
}
}
However, I don't like it because I think that other better data structures or different algorithms could efficiently remove duplicated than my code.
So I want to look for better ways to quickly remove duplicated values when the data is large.
I appreciate it if you could let me know the fastest way of removing duplicated values.
And also, I am wondering if the number of duplicated values could affect the performance. I mean if the duplicated values is 50% of the original data, then the selection of best algorithm and data structures will be changed? If so, I want to find a way that can achieve good performance in general cases.