2

I have a list like the following in python (the real one is huge and I cannot do this only by looking at it):

original1=[['email', 'tel', 'fecha', 'descripcion', 'categ'],
          ['a@gmail.com', '1', '2014-08-06 00:00:06', 'MySpace a', 'animales'],
          ['b@gmail.com', '1', '2014-08-01 00:00:06', 'My Space a', 'ropa'],
          ['a@gmail.com', '2', '2014-08-06 00:00:06', 'My Space b', 'electronica'],
          ['b@gmail.com', '3', '2014-08-10 00:00:06', 'Myace c', 'animales'],
          ['c@gmail.com', '4', '2014-08-10 00:00:06', 'Myace c', 'animales']]

I split it between data and names to work with data:

datos=original1[-(len(original1)-1):len(original1)]

I need to do a dictionary that has all the duplicates together, considering email and tel, but I need to apply transitivity: since line 0= line 2 if we consider email, but also line 1 if we consider tel, and line 1= line 3 if we consider email again, I need to get that all candidates in this case are 0,1,2 and 3, while 4 is alone.

I created the following code:

from collections import defaultdict
email_to_indices = defaultdict(list) 
phone_to_indices = defaultdict(list)

for idx, row in enumerate(datos): 
    email = row[0].lower() 
    phone = row[1]
    email_to_indices[email].append(idx) 
    phone_to_indices[phone].append(idx)

So now I need to apply transitivity rules, to get together 0 to 3 and alone 4.

If you print

print 'email', email_to_indices
print 'phone', phone_to_indices

You get:

email defaultdict(, {'a@gmail.com': [0, 2],'b@gmail.com': [1, 3], 'c@gmail.com': [4]})

phone defaultdict(, {'1': [0, 1], '3': [3], '2': [2], '4': [4]})

Don't know how to get the union of those considering the transitive property. I need to get something like:

first_group: [0, 1, 2 , 3]
second_group: [4]

Thanks!

GabyLP
  • 3,649
  • 7
  • 45
  • 66
  • 1
    What should be the key of the output dict? Should it have a key for each unique email and phone, each of which references the same list of data, or should there be some sort of amalgamated key constructed of all emails and numbers that overlap? What's your expected output? – Silas Ray Aug 27 '14 at 18:16
  • It seems to me the natural course of action would be something data structure like `[ [line0, line1, line2, line3], [line4] ]` – Adam Smith Aug 27 '14 at 18:28
  • Adam, the thing is that this is an example, the real table is huge. that's why I'm writing code. – GabyLP Aug 27 '14 at 18:29
  • GabyP I'm simply proposing a data structure to hold it in, in response to @SilasRay's comment, above :) – Adam Smith Aug 27 '14 at 19:02

2 Answers2

2

Here you have a graph, or Bipartite graph to be more precise. The nodes are of two types: emails and phones. Two nodes are connected if there exists a record with that email and phone. Or we even can say that the record itself is the edge which connects two nodes.

The task is to find Connected components of this graph. By following the links you can find algorithms which can do it in linear time.

Of course some quick and dirty solutions can also be invented, and even may be deemed appropriate if your dataset is small enough.

You can find some Python implementations here: Python connected components

UPDATE: Here is an example of how you can construct the graph:

graph = {};
EMAIL = "email";
PHONE = "phone";

for rec in datos:
    graph.setdefault((EMAIL, rec[0]), set()).add((PHONE, rec[1]));
    graph.setdefault((PHONE, rec[1]), set()).add((EMAIL, rec[0]));

print "\n".join("%s: %s" % (str(node), str(linkedNodes)) for (node, linkedNodes) in graph.iteritems());

So every node has a type (EMAIL or PHONE, they can actually be just integers, for example 0 and 1, I made them strings only for nice printing) and a value. Graph is a dictionary with nodes as keys and sets of connected nodes as values.

Community
  • 1
  • 1
Anton Savin
  • 40,838
  • 8
  • 54
  • 90
  • Hi Anton, that would be a good idea, but is there a module in python to do so? and how would that be? – GabyLP Aug 27 '14 at 18:41
  • @GabyP I think no, but [quick search](https://www.google.com/webhp?sourceid=chrome-instant&ion=1&espv=2&ie=UTF-8#q=python%20connected%20components) gives a lot of information, including related questions on SO. – Anton Savin Aug 27 '14 at 18:44
0

This is another approach:

When you are building the email_to_indices dictionary, you can store the phone number of that row as the values and then have the phone_to_indices have the index of the row. That way we are creating a email_to_indices to phone_to_indices to index of the row map.

With that modification and basic set operations I am able to get what you want exactly:

from collections import defaultdict

email_to_indices = defaultdict(list)
phone_to_indices = defaultdict(list)
combined = defaultdict(set)

original=[['email', 'tel', 'fecha', 'descripcion', 'categ'],
          ['a@gmail.com', '1', '2014-08-06 00:00:06', 'MySpace a', 'animales'],
          ['b@gmail.com', '1', '2014-08-01 00:00:06', 'My Space a', 'ropa'],
          ['a@gmail.com', '2', '2014-08-06 00:00:06', 'My Space b', 'electronica'],
          ['b@gmail.com', '3', '2014-08-10 00:00:06', 'Myace c', 'animales'],
          ['c@gmail.com', '4', '2014-08-10 00:00:06', 'Myace c', 'animales']]


for idx, row in enumerate(original[1:], start=1):
    email = row[0].lower()
    phone = row[1]
    email_to_indices[email].append(phone) # Here is what I changed
    phone_to_indices[phone].append(idx)

random_key = 0
for idx, row in enumerate(original[1:], start=1):
    grouped_rows = []
    if row[0].lower() in email_to_indices:
        for phone_no in email_to_indices[row[0].lower()]:
            grouped_rows.extend(phone_to_indices[phone_no])

    if len(combined[random_key]) > 0 and len(set(grouped_rows).intersection(combined[random_key])) > 0:
        combined[random_key].update(set(grouped_rows))
    elif len(combined[random_key]) > 0:
        random_key += 1
        combined[random_key].update(set(grouped_rows))
    else:
        combined[random_key].update(set(grouped_rows))

print combined

This gives:

defaultdict(<type 'set'>, {0: set([1, 2, 3, 4]), 1: set([5])})
shaktimaan
  • 11,962
  • 2
  • 29
  • 33