Merge many unsorted lists efficiently

Question

I have many huge unsorted List containing (user,amountspend) tuples. Each list corresponds to one day. Now I want to merge the all the list to have one list which contains accumulated value for a given user. I have two approaches:

Approach 1: Sort the individual lists and then use merge sort iteratively.
Approach 2: Form a HashMap with user as key and then iterate through list and update the value for key if it exist or add a new key with value.

If m list are there and each list may have different length (k1,k2,...,km)

Questions:

Which is efficient solution?
Which solution can be run in multiple threads?
Or is there any better solution?

Example:

Day 1: (user1,100),(user2,200)
Day 2: (user2,10),(user1,100),(user3,10)
List after merging: (user1,200),(user2,210),(user3,10)

Looks like an ideal problem for the [map reduce programming model](https://en.wikipedia.org/wiki/MapReduce) — Peter de Rivaz, Dec 31 '15 at 09:48
Just noticed one thing dearly missing from the task description: Are the keys/users unique in each given list? For updates of values kept by key in a `java.util.Map`, I urge to consider [merge(key, value, remappingFunction)](https://docs.oracle.com/javase/8/docs/api/java/util/Map.html#merge-K-V-java.util.function.BiFunction-). — greybeard, Jan 01 '16 at 03:23
In approach one while merge sorting you need to add the values if they are user is same. — Knight71, Jan 02 '16 at 16:25

score 1 · Answer 1 · answered Dec 31 '15 at 15:24

1

The HashMap approach is better since it's O(N). Both solution could be run in multiple threads, but need different modifications to support concurrency.

answered Dec 31 '15 at 15:24

gen-y-s

871
6
12

Using Approach one , concurrency can be easily achieved by having pool of threads merging two list and putting back in a queue. But I am not sure how would you do it with hashmap ? – Knight71 Dec 31 '15 at 16:27
1

You can have threads for the second approach by having each thread take the next list in the queue and process it and update the hash table accordingly (the hash table access should be synchronized at the bucket level) – gen-y-s Dec 31 '15 at 21:21

score 1 · Accepted Answer · answered Jan 01 '16 at 10:24

1 Overall approach and complexity

I suppose:

number of client is large but finite = N
number of days keep growing = M
and perhaps each day, we have every client (or almost that).

Minimal complexity to do the job:

process each data, so M.N operations. As you dont want to keep elements for the sum, you only need to do: partial sum + new value, so, everything takes M.N x finite time (I suppose you dont have billions of billions of billions of $)
you have to agregate datas, on N clients (and for each data, you have to find, sum, store, ... on each client). For me, minimum time is to sort the clients at least once (or any method to index them), so time to do that is O(N log N) with the best algorithms and best implementations (there exists also faster ways, but with very huge amount of space).

So, you need at least O(N log N) + O(M.N).

2 possible solutions:

Your approach 1 wastes time: because you sort every list (with one same data). You would need M.O(N log N)+O(M.N). You only need one sort (to be able to sum after).

Your approach 2 is the shortest way.

3 How to parallelize ?

You have (at least) 2 ways to split your datas: with days, or with clients. Because you want to sum on clients, use the second.

Your process is easy scalable.

Then you can use a simple hash function (first or last character of client, or something very simple) => each thread (or process, or machine) receives every datas, and keep only datas for its clients.

You can split every job like that (process, sum, retrieve, ...).

If will take almost same overall time:

with k process, you will have k.O (N/k log N/k) + k*Ox(M.N) + k.O(M.N/k)

The time you win by splitting N/k, you give back by selecting (Ox, I suppose very fast).

Then you can distribute your job on many machines, which will be independent.

Hope it helps.

Fantastic answer ! What if the client list is already sorted per day ? In that case , do u still prefer the second one ? — Knight71, Jan 01 '16 at 12:45
If you have almost every clients in the first list, then you win the O(N log N); because it is already done. If you have packets of new users, it is like merging already-sorted packets which is linear. Normally, you should be able to do it with TreeMaps. — guillaume girod-vitouchkina, Jan 01 '16 at 13:00
([you don't have billions of billions of billions of $](http://www.barrypopik.com/index.php/new_york_city/entry/a_billion_here_a_billion_there_pretty_soon_youre_talking_real_money/) - `long`s go a _long_ way, even using Millicents - err - millicents.) In the context of parallel-processing (tag) and `distribute your job on many machines`, _client_ may need definition, esp. with the question not mentioning it, but `user`. — greybeard, Jan 01 '16 at 13:42

Erobrere · Answer 3 · 2016-01-02T23:16:18.930

-1

The complexity of sort and merge solution is O(mn log n) + O (n log m), where m is the number of lists assuming each list is of size n.

To compute complexity of hash based solution, let us assume there are k users. The insert operation of k elements into HashMap (Java) or map (C++) take O(k log k). Changing the values for mn-k values in the best case take (mn-k)O(1) and in the worst case takes (mn-k)O(log k). The overall complexity is O(mn log k). So, Hashing seems better of the two specially when k is much lesser than mn.

edited Jan 02 '16 at 23:16

answered Dec 31 '15 at 12:11

Erobrere

388
1
12

Wrong... wrong.. wrong.... inserting and updating item into a hash table is O(1). In C++ it's unordered_map , in Java it's the HashMap. Insering all elements into the hash would be O(N) where N is the total number of items in all lists (this is not exactly mn since the lists are of different sizes). – gen-y-s Dec 31 '15 at 15:17
Also, merging m lists with N items in total, would take O(N log m) time. – gen-y-s Dec 31 '15 at 15:20
@greybeard Yes, each user will have a value hence one pair per user. Logarithmic doesn't mean O(n log n) always (e.g., binary search has O(log N)). Please check again and let me know. – Erobrere Jan 01 '16 at 01:20
@gen-y-s Please lookup how HashMap and map are implemented and you'll find that they are actually trees. Therefore, logarithmic insertion time. Even if the lists are not of the same sizes, why would you inserted all elements into HashMap ?? That's exactly why I mentioned that some k will be inserted, 0 =< k =< mN. In the first method, if you have to merge a total of `mN` elements, how can you do it in O(N log m) time? I'm curious to learn that method. – Erobrere Jan 01 '16 at 01:21
@Erobrere HashMap is a hash table, and TreeMap is a binary search tree. Before you tell anyone to look up anything, you better look it up yourself. And btw, Map is only an interface, the implementations are HashMap and TreeMap. – gen-y-s Jan 01 '16 at 12:01
@Erobrere as for merging, I used N as the total number of items in all the lists together (it's not the size of one list). Merging m sorted lists with total of N items is O(N log m) look it up: https://en.m.wikipedia.org/wiki/Merge_algorithm#K-way_merging – gen-y-s Jan 01 '16 at 12:09
First, thanks to both I've identified two mistakes: (a) K-way merging yields the complexity of O(mn log n) + O(n log m) in the first solution; (sorry that i've to split comments since they are long). – Erobrere Jan 01 '16 at 20:18
(b) I got the O(N) worst case complexity of get in `HashMap` based on JDK7. As http://stackoverflow.com/questions/4553624/hashmap-get-put-complexity points out that JDK8 will have O(log N) similar to C++ (see http://en.cppreference.com/w/cpp/container/map and https://en.wikipedia.org/wiki/Red%E2%80%93black_tree). The mistake is interchanged (mn-k) and k; the actual worst-case complexity should be: O(mn log k) in the second method. The best & avg-case are linear-time O(mn). Please tell me if I'm okay now & do point to better sources if I'm wrong. I'm happy to learn too! – Erobrere Jan 01 '16 at 20:41

Merge many unsorted lists efficiently

3 Answers3