0

I have a csv file separated by tabs:

enter image description here

I need only to focus in the two first columns and find, for example, if the pair A-B appears in the document again as B-A and print A-B if the B-A appears. The same for the rest of pairs.

For the example proposed the output is: · A-B & C-D

    dic ={}
    import sys
    import os
    import pandas as pd
    import numpy as np
    import csv

    colnames = ['col1', 'col2', 'col3', 'col4', 'col5']

    data = pd.read_csv('koko.csv', names=colnames, delimiter='\t')

    col1 = data.col1.tolist()
    col2 = data.col2.tolist()

    dataset = list(zip(col1,col2))
    for a,b in dataset:
        if (a,b) and (b,a) in dataset:
        dic [a] = b
print (dic)

output = {'A': 'B', 'B': 'A', 'D': 'C', 'C':'D'}

How can I avoid duplicated (or swapped) results in the dictionary?

primemg
  • 3
  • 3
  • If one of the below solutions helped, please consider accepting it (green tick on the left), so other users know. Or, of course, feel free to ask for clarification. – jpp Mar 17 '18 at 03:38

3 Answers3

0

Does this work?:

import pandas as pd
import numpy as np

col_1 = ['A', 'B', 'C', 'B', 'D']
col_2 = ['B', 'C', 'D', 'A', 'C']

df = pd.DataFrame(np.column_stack([col_1,col_2]), columns = ['Col1', 'Col2'])

df['combined'] = list(zip(df['Col1'], df['Col2']))

final_set = set(tuple(sorted(t)) for t in df['combined'])

final_set looks like this:

 {('C', 'D'), ('A', 'B'), ('B', 'C')}

The output contains more than A-B and C-D because of the second row that has B-C

Connor John
  • 433
  • 2
  • 8
0

The below should work,

example df used:

df = pd.DataFrame({'Col1' : ['A','C','D','B','D','A'], 'Col2' : ['B','D','C','A','C','B']})

This is the function I used:

 temp = df[['Col1','Col2']].apply(lambda row: sorted(row), axis = 1)
 print(temp[['Col1','Col2']].drop_duplicates())

useful links:

checking if a string is in alphabetical order in python

Difference between map, applymap and apply methods in Pandas

RK1
  • 2,384
  • 1
  • 19
  • 36
  • 1
    Ah, sorry wasn't clear that the output should be limited to A-B and C-D, as the example given only contains A-B and C-D entries... and states "The same for the rest of pairs." – RK1 Mar 06 '18 at 08:38
0

Here is one way.

df = pd.DataFrame({'Col1' : ['A','C','D','B','D','A','E'],
                   'Col2' : ['B','D','C','A','C','B','F']})

df = df.drop_duplicates()\
       .apply(sorted, axis=1)\
       .loc[df.duplicated(subset=['Col1', 'Col2'], keep=False)]\
       .drop_duplicates()

#   Col1 Col2
# 0    A    B
# 1    C    D

Explanation

The steps are:

  1. Remove duplicate rows.
  2. Sort dataframe by row.
  3. Remove unique rows by keeping only duplicates.
  4. Remove duplicate rows again.
jpp
  • 159,742
  • 34
  • 281
  • 339