1

I have a data set as the following:

input file:

id  addr
301 1
301 2
301 3
301 4
302 6
302 7
302 8
302 9
302 1
303 14
303 15
303 2
304 16
304 17
304 1

and I need Python code to print out all the possible pair combinations of addr values with common id. There are millions of id and corresponding addr value records in the main test file. So, the code should be able to read columns from a text file.The output will be as follows (only showing for 301 and 302, the rest will continue the pattern):

1   2
1   3
1   4   
2   3
2   4
3   4   
6   7
6   8
6   9
7   8
7   9   
8   9
1   6
1   7
1   8
1   9
2   6
2   7
2   8
2   9
3   6  
3   7
3   8
3   9
4   6
4   7
4   8
4   9
1   15
2   15
3   15
......
1   16
2   16
......
15  16   

So far I have done the following, but I do not have any idea how to code the pair combination part. I am new in Python, so will appreciate if someone can help me do the coding with a little bit of explanation.

# coding: utf-8

# sample tested in python 3.6

import sys
from pip._vendor.pyparsing import empty

if len(sys.argv) < 2:
    sys.stderr.write("Usage: {0} filename\n".format(sys.argv[0]))
    sys.exit()

fn = sys.argv[1]
sys.stderr.write("reading " + fn + "...\n")

# Initialize empty set 
s = {}
line= 0
fin = open(fn,"r")
for line in fin:
    line = line.rstrip()
    f = line.split("\t")
    line +=1
    if line is 1:
        txid_prev = line 
        addr = line 
        s= addr
        continue
    txid=line
    txid_prev=line
    if txid is txid_prev:
        s.push(addr)
    else:
        # connect all pairs in s
        # print all pairs as edges
        s=addr
    txid_prev=txid
if s is not empty:
    # connect and print all edges   
Community
  • 1
  • 1
Rubz
  • 95
  • 8
  • numpy has a function loadtxt so you don't have to take care of that. https://docs.scipy.org/doc/numpy/reference/generated/numpy.loadtxt.html For a numpy solution check https://stackoverflow.com/questions/1208118/using-numpy-to-build-an-array-of-all-combinations-of-two-arrays or https://stackoverflow.com/questions/27286537/numpy-efficient-way-to-generate-combinations-from-given-ranges – Joe Aug 04 '17 at 10:00
  • Or just use `pandas` for handling your `DataFrame`s – damisan Aug 04 '17 at 10:07
  • 2
    You can generate those combinations using [`itertools.combinations`](https://docs.python.org/3/library/itertools.html#itertools.combinations) – PM 2Ring Aug 04 '17 at 10:12
  • Your code had a few lines that weren't valid Python code. They look like pseudo-code, so I turned them into comments. – PM 2Ring Aug 04 '17 at 10:14
  • There seem to be a few lines too much in your output, everything from (2,6) onward. – tobias_k Aug 04 '17 at 11:33

3 Answers3

1

How about something like this:

import pandas as pd
import io
import itertools

file="""id addr
301 1
301 2
301 3
301 4
302 6
302 7
302 8
302 9
302 1"""

df= pd.read_csv(io.StringIO(file), sep=" ")

for key,value in df.set_index("addr").groupby("id").groups.items():
    print(key)
    for item in list(itertools.combinations(value.values, 2)):
        print("{} {}".format(*item))

Prints:

301
1 2
1 3
1 4
2 3
2 4
3 4
302
6 7
6 8
6 9
6 1
7 8
7 9
7 1
8 9
8 1
9 1

Alternatively we can put the values in a Dictionary:

a = {} 

for id_,addr in df.values.tolist():
    a.setdefault(str(id_),[]).append(addr)

output = {key:list(itertools.combinations(value, 2)) for key,value in a.items()}


def return_combos(dict_, keys):
    values = []
    for i in keys:
        values.append(a[i])
    values = list(set([i for item in values for i in item]))
    return {','.join(keys):list(itertools.combinations(values, 2))}


output2 = return_combos(a, ["301","302"])

output prints:

{'301': [(1, 2), (1, 3), (1, 4), (2, 3), (2, 4), (3, 4)],
 '302': [(6, 7),
  (6, 8),
  (6, 9),
  (6, 1),
  (7, 8),
  (7, 9),
  (7, 1),
  (8, 9),
  (8, 1),
  (9, 1)]} 

meanwhile output2 outputs:

{'301,302': [(1, 2),
  (1, 3),
  (1, 4),
  (1, 6),
  (1, 7),
  (1, 8),
  (1, 9),
  (2, 3),
  (2, 4),
  (2, 6),
  (2, 7),
  (2, 8),
  (2, 9),
  (3, 4),
  (3, 6),
  (3, 7),
  (3, 8),
  (3, 9),
  (4, 6),
  (4, 7),
  (4, 8),
  (4, 9),
  (6, 7),
  (6, 8),
  (6, 9),
  (7, 8),
  (7, 9),
  (8, 9)]}

update2 or 3: Is this the desired output??

import pandas as pd
import io
import itertools
from collections import OrderedDict

file="""id addr
301 1
301 2
301 3
301 4
302 6
302 7
302 8
302 9
302 1
303 14
303 12"""

df= pd.read_csv(io.StringIO(file), sep=" ")

b = OrderedDict()

for id_,addr in df.values.tolist():
    b.setdefault(str(id_),[]).append((id_,addr))

pairs = [(list(b.keys())[i],list(b.keys())[i+1]) for i in range(len(list(b.keys()))-1)]

output = {}
for pair in pairs:
    output[pair] = [[(i[0][0],i[1][0]),i[0][1],i[1][1]] for i in list(itertools.combinations(b[pair[0]]+b[pair[1]], 2))]

output    

{('301', '302'): [[(301, 301), 1, 2],
  [(301, 301), 1, 3],
  [(301, 301), 1, 4],
  [(301, 302), 1, 6],
  [(301, 302), 1, 7],
  [(301, 302), 1, 8],
  [(301, 302), 1, 9],
  [(301, 302), 1, 1],
  [(301, 301), 2, 3],
  [(301, 301), 2, 4],
  [(301, 302), 2, 6],
  [(301, 302), 2, 7],
  [(301, 302), 2, 8],
  [(301, 302), 2, 9],
  [(301, 302), 2, 1],
  [(301, 301), 3, 4],
  [(301, 302), 3, 6],
  [(301, 302), 3, 7],
  [(301, 302), 3, 8],
  [(301, 302), 3, 9],
  [(301, 302), 3, 1],
  [(301, 302), 4, 6],
  [(301, 302), 4, 7],
  [(301, 302), 4, 8],
  [(301, 302), 4, 9],
  [(301, 302), 4, 1],
  [(302, 302), 6, 7],
  [(302, 302), 6, 8],
  [(302, 302), 6, 9],
  [(302, 302), 6, 1],
  [(302, 302), 7, 8],
  [(302, 302), 7, 9],
  [(302, 302), 7, 1],
  [(302, 302), 8, 9],
  [(302, 302), 8, 1],
  [(302, 302), 9, 1]],
 ('302', '303'): [[(302, 302), 6, 7],
  [(302, 302), 6, 8],
  [(302, 302), 6, 9],
  [(302, 302), 6, 1],
  [(302, 303), 6, 14],
  [(302, 303), 6, 12],
  [(302, 302), 7, 8],
  [(302, 302), 7, 9],
  [(302, 302), 7, 1],
  [(302, 303), 7, 14],
  [(302, 303), 7, 12],
  [(302, 302), 8, 9],
  [(302, 302), 8, 1],
  [(302, 303), 8, 14],
  [(302, 303), 8, 12],
  [(302, 302), 9, 1],
  [(302, 303), 9, 14],
  [(302, 303), 9, 12],
  [(302, 303), 1, 14],
  [(302, 303), 1, 12],
  [(303, 303), 14, 12]]}
Anton vBR
  • 18,287
  • 5
  • 40
  • 46
  • Many thanks for your quick reply. I'll check this and will get back to you. – Rubz Aug 04 '17 at 10:50
  • some output combinations are missing. for id 301 and 302 we have 8 numbers (1,2,3,4,6,7,8,9) and in pairs, there should be 8C2=28 combinations. Please see my desired output list. thanks. – Rubz Aug 04 '17 at 17:01
  • @rubz I think you are counting it wrong: id 301 we have 4 numbers = 6 combinations, and id 302, 5 numbers = 10 combinations. Total 16 combinations. Or are you thinking of something else? – Anton vBR Aug 04 '17 at 17:21
  • compare output (16 combos) and output 2 (28 combos) - one is the combos with per id (301,302) and the other the combos with id 301+id 302 removing duplicates – Anton vBR Aug 04 '17 at 17:54
  • Output 2 is what I want . And if they are printed as column wise with respect to ID that would be super helpful. Thanks. – Rubz Aug 05 '17 at 01:42
  • Please kindly update your code according to output2 and I forgot to mention in the question that there are millions of id and corresponding addr value records in the main test file. So, the code should be able to read columns from a text file. Thanks for your continuous support. – Rubz Aug 05 '17 at 11:55
  • @Rubz I think I am seeing what you ask for. But please update your question with desired output in its entiry. Is it a dataframe? Does it have index, how many columns? – Anton vBR Aug 05 '17 at 13:11
  • Let us [continue this discussion in chat](http://chat.stackoverflow.com/rooms/151150/discussion-between-rubz-and-anton-vbr). – Rubz Aug 05 '17 at 16:11
  • I have just updated my desired output. And all the paired combinations output in just two columns. it will be like an edge graph. So, I can just plot it. I wanted to put your attention again on my original input file contains millions of id and corresponding addr values. – Rubz Aug 05 '17 at 17:26
  • So the code needs to have read functions of a text file to read line by line to match ID with previous line ID. when there is a mismatch it will collect all the addr values for that ID and store paired combinations(Please see my code). So, the output file, I just desire to have the addr paired combination values (as I don't need duplication and self-loop) in two columns can be exported to a text file. Thanks. – Rubz Aug 05 '17 at 17:26
0

How about this (if you don't care about duplicates)? [Live Demo]

import io
import pandas as pd

data =  io.StringIO("""id addr\n301 1\n301 2\n301 3\n301 4\n302 6\n302 7\n302 8\n302 9\n302 1\n303 14\n303 15\n303 2\n304 16\n304 17\n304 1""")

df = pd.read_table(data, sep=' ', index_col=0)
print(df.join(df, rsuffix="_2"))

Output:

     addr  addr_2
id               
301     1       1
301     1       2
301     1       3
301     1       4
301     2       1
301     2       2
301     2       3
301     2       4
301     3       1
...
damisan
  • 1,037
  • 6
  • 17
  • Many thanks for your quick reply.But, I requested for the pair wise combination of four numbers if you consider 301. So, a total number of pair combination should be 4C2 = 6. Please recheck my desired output. your output structure looks nice!It would look great if the id numbers can be shown in the first column.Thanks – Rubz Aug 04 '17 at 10:56
0

This problem can be split up into two parts. First, construct a dictionary that maps identifiers to lists of addresses. Second, generate the length-2 combinations of each of these lists.

from collections import defaultdict
from itertools import combinations

# get lines from your file
f = open('input_file.txt')
lines = f.readlines()
f.close()

# build mapping from file
iden_mapping = defaultdict(list)
for row in lines[1:]:
    iden, addr = row.split()
    iden_mapping[iden].append(addr)

# generate combination from address lists
for iden in sorted(iden_mapping):
    for c in combinations(iden_mapping[iden], 2):
        print(c)
Jared Goguen
  • 8,772
  • 2
  • 18
  • 36
  • Many thanks for your quick solutions. But it can't split the row values resulting a traceback of : line 12, in iden, addr = row.split() ValueError: not enough values to unpack (expected 2, got 0) – Rubz Aug 05 '17 at 11:12