1

I'll try and describe my problem as best as I can, but please ask if there are things that make no sense.

  • I have a finite number of lists
  • Each list contains a finite number of contacts
  • Each contact is represented as a HashMap
  • Each list is linked to a provider
  • The same contact may be present in multiple providers (and hence multiple lists).
  • I need a 'master' list that contains all the unique entries in the other lists

I'm looking for an efficient way to merge these lists into a master list without duplicates. For example if the same contact appears in multiple lists (multiple HashMaps corresponding to the same physical person) I want to merge all the HashMaps into a single one, and put the merged HashMap into the master list. A simple 'putall' here won't do since I need to re-key the contents to efficiently access them (eg. provider one gives me a list of email addresses keyed as 'emails' and provider 2 gives me the same info keyed as 'emailList').

Merging the individual HashMaps is the easier of two problems since I know these keys and can easily merge them.

The problem that has me scratching my head is efficient scanning of the lists ... is there no other way than linearly going through each list in a nested loop, grabbing the next HashMap, checking if it already exists in the mater list and merging/creating a new one ... ?

copolii
  • 14,208
  • 10
  • 51
  • 80
  • Can you list out the actual code definitions of these structures? What do you mean that a contact is represented by a HashMap? What's the key and what's the value? – WhiteFang34 Apr 27 '11 at 22:14
  • Might help to look at this question http://stackoverflow.com/questions/2405073/remove-duplicates-from-a-sorted-arraylist-while-keeping-some-elements-from-the-du – CoolBeans Apr 27 '11 at 22:19
  • a contact (at this stage) is a parsed JSON response, so it's a bunch of key-value pairs i.e. {"id"="123456", "name"="bob the contact", "email"="bob@bobmail.bob" } – copolii Apr 27 '11 at 23:32

4 Answers4

1

First observation - using a HashMap to represent your contacts smells of "object denial".

You need to design and implement a Contact class to represent a contact. Without this class, your task is a whole bunch harder than it needs to be.

The class needs getters for all of the contact key fields, and it needs to implement equals, hashcode and Comparable based on the key fields. Getters (and optionally setters) are also needed for non-key fields.

With that, the merging process becomes (pseudo-code):

// If you haven't already done so
convert the master list of HashMaps to a list of Contact objects and sort it.
create an empty "new master" list

for each list of contact HashMaps:
    convert the list of HashMaps to a merge list of Contact objects
    sort the merge list
    iterate the sorted master and merge lists in parallel:
        if a master Contact matches a merge Contact:
            merge the two Contacts and add to the new master list
            advance both iterators
        if a master Contact has no corresponding merge Contact:
            copy the master Contact to the new master list
            advance the master iterator.
        if a merge Contact has no corresponding master Contact:
            add the merge Contact to the new master list.
            advance the merge iterator

The performance characteristics of the various phases should be:

  • Conversion of N HashMaps to Contact objects - O(N).
  • Creation of list of N Contacts - O(N)
  • Sort list of N Contacts - O(NlogN)
  • Merging of 2 sorted lists - O(M + N).

The overall performance should be better than O(NlogN) where N is the total number of master and merge Customer objects.

Community
  • 1
  • 1
Stephen C
  • 698,415
  • 94
  • 811
  • 1,216
  • The object comes from various databases and JSON (i.e. facebook) and not all providers provide complete enough HashMaps to define a contact – copolii Apr 27 '11 at 23:23
  • @copolii - that simply means that your Contact class needs to be able to represent an incomplete contact. You *are* in object denial! – Stephen C Apr 28 '11 at 07:40
  • lol ... it's not my code, I've started half way through the project and yes, I've been looking for reasons why there is no 'Contact' object – copolii Apr 28 '11 at 17:28
  • so ... I talked to the devs and turns out this is a sever case of Object denial. It was written initially by a Perl scripter (??) who "didn't like" objects and these guys were just too lazy to fix it. I am re-engineering the code starting today. Thanks for your help, I think this reply takes the cake as the 'solution'. – copolii Apr 28 '11 at 19:19
0

Make a Map<String,Contact> using a class like the one below. Although, I'm still not sure what you mean by Provider. Perhaps you could provide some more detail on that.

class Contact {

    enum ContactMethod {
        email,
        phone,
        address
    }

    String name;
    Map<ContactMethod,Set<String>> contactInfo;

    Contact(String name) {
        this.name = name;
        this.contactInfo = new HashMap<ContactMethod,Set<String>>();
    }

    void consume(Map<ContactMethod,String> info) {
        for(ContactMethod method : info.keySet()) {
            Set<String> modes = contactInfo.get(method);
            if(modes == null) {
                modes = new HashSet<String>();
                contactInfo.put(method,modes);
            }
            modes.add(info.get(method));
        }
    }
}
corsiKa
  • 81,495
  • 25
  • 153
  • 204
  • Rather than iterating over just the keys and later having to look up the corresponding value again, replace your use of `Map#keySet()` with `Map#entrySet()`. You can then replace the expression `info.get(method)` with `entry.getValue()`. – seh Apr 27 '11 at 23:07
  • a provider is an interface that gets the contacts froma source (i.e. your facebook friends, or outlook contacts, etc.) – copolii Apr 27 '11 at 23:33
  • @seh I am aware of it. I find it to be bulky and unfriendly to both look at and type. Knowing I have a HashMap underneath whenever I use it, the penalty for the additional look-up is negligible. You are correct that in the general case you should do that, since you can't generally guarantee the complexity of the lookup. – corsiKa Apr 28 '11 at 00:35
  • You make a good point about the syntactic overhead. Every time I realize that I should use that approach, I fumble while trying to type `Map.Entry<...>` in the `for()` declaration. – seh Apr 28 '11 at 18:22
0

For your internal master list, can you use a class that you can define a meaningful equals() on to encapsulate the HashMap, instead of just storing straight HashMaps? If you did that, you could switch to using a Collection implementation that has constant-time lookups (e.g., HashSet) for the Master List. That will eliminate the nested iteration and you'll just have to check each Contact from a Provider once. It's trial and error to determine if your number of contacts is large enough that it's an improvement.

Affe
  • 47,174
  • 11
  • 83
  • 83
  • to clarify: the master list is actually a HashMap with key being the contact ID and the value being the Contact HashMap – copolii Apr 27 '11 at 23:21
0

If your lists are sorted, consider this:

Create a "merging" iterator that consumes 2 iterators from your lists.
If the 2 heads are the same, toss one. Otherwise present the smaller of the two.
If one head is from an exhausted (empty) iterator, simply present the other.

Now you have an iterator that produces a unique sorted sequence from 2 iterators.

You can pile these up as many as needed to get a unique iterator for all your lists.

Mark Bolusmjak
  • 23,606
  • 10
  • 74
  • 129
  • the lists are not sorted (they're actually HashMaps that use the contact's ID as key for fast lookup) – copolii Apr 27 '11 at 23:34