Eliminate duplicates in dictionary Python

Question

I have a csv file separated by tabs:

I need only to focus in the two first columns and find, for example, if the pair A-B appears in the document again as B-A and print A-B if the B-A appears. The same for the rest of pairs.

For the example proposed the output is: · A-B & C-D

    dic ={}
    import sys
    import os
    import pandas as pd
    import numpy as np
    import csv

    colnames = ['col1', 'col2', 'col3', 'col4', 'col5']

    data = pd.read_csv('koko.csv', names=colnames, delimiter='\t')

    col1 = data.col1.tolist()
    col2 = data.col2.tolist()

    dataset = list(zip(col1,col2))
    for a,b in dataset:
        if (a,b) and (b,a) in dataset:
        dic [a] = b
print (dic)

output = {'A': 'B', 'B': 'A', 'D': 'C', 'C':'D'}

How can I avoid duplicated (or swapped) results in the dictionary?

If one of the below solutions helped, please consider accepting it (green tick on the left), so other users know. Or, of course, feel free to ask for clarification. — jpp, Mar 17 '18 at 03:38

score 0 · Answer 1 · answered Mar 05 '18 at 20:44

0

Does this work?:

import pandas as pd
import numpy as np

col_1 = ['A', 'B', 'C', 'B', 'D']
col_2 = ['B', 'C', 'D', 'A', 'C']

df = pd.DataFrame(np.column_stack([col_1,col_2]), columns = ['Col1', 'Col2'])

df['combined'] = list(zip(df['Col1'], df['Col2']))

final_set = set(tuple(sorted(t)) for t in df['combined'])

final_set looks like this:

 {('C', 'D'), ('A', 'B'), ('B', 'C')}

The output contains more than A-B and C-D because of the second row that has B-C

answered Mar 05 '18 at 20:44

Connor John

433
2
8

But B-C is not a possible answer because there is no C-B – primemg Mar 05 '18 at 22:33
What is the expected behavior for the second row then? it has B for Col1 and C for Col2. I must be misinterpreting the problem – Connor John Mar 06 '18 at 00:05

score 0 · Answer 2 · answered Mar 05 '18 at 21:03

0

The below should work,

example df used:

df = pd.DataFrame({'Col1' : ['A','C','D','B','D','A'], 'Col2' : ['B','D','C','A','C','B']})

This is the function I used:

 temp = df[['Col1','Col2']].apply(lambda row: sorted(row), axis = 1)
 print(temp[['Col1','Col2']].drop_duplicates())

useful links:

checking if a string is in alphabetical order in python

Difference between map, applymap and apply methods in Pandas

answered Mar 05 '18 at 21:03

RK1

2,384
1
19
36

1

Ah, sorry wasn't clear that the output should be limited to A-B and C-D, as the example given only contains A-B and C-D entries... and states "The same for the rest of pairs." – RK1 Mar 06 '18 at 08:38

score 0 · Answer 3 · answered Mar 06 '18 at 00:15

Here is one way.

df = pd.DataFrame({'Col1' : ['A','C','D','B','D','A','E'],
                   'Col2' : ['B','D','C','A','C','B','F']})

df = df.drop_duplicates()\
       .apply(sorted, axis=1)\
       .loc[df.duplicated(subset=['Col1', 'Col2'], keep=False)]\
       .drop_duplicates()

#   Col1 Col2
# 0    A    B
# 1    C    D

Explanation

The steps are:

Remove duplicate rows.
Sort dataframe by row.
Remove unique rows by keeping only duplicates.
Remove duplicate rows again.

Eliminate duplicates in dictionary Python

3 Answers3