1

I have a dataframe in which each column contains ID numbers; looks like this:

LC3B.low    LC3B.hi  P62.low    P62.hi
PT 65       PT 172   PT 86      PT 135
PT 86       PT 65    PT 38      PT 56
PT 251      PT 251   PT 217     PT 261

I want to list IDs that are seen in two or more columns. So for the values supplied, pandas would indicate that:

  • LC3B.low and LC3B.hi share "PT 65" and "PT 251"
  • LC3B.low and P62.low share "PT 86"

Im new to Pandas and used to Perl. In Perl, I'd solve this problem by creating arrays and hashes of each column then checking each element of each array against each hash with print statements accompanying each comparison as well as each match so my output would look like this:

LC3B.low vs LC3B.hi
PT 65
PT 251
LC3B.low vs P62.low
PT 86
LC3B.low vs P62.hi
LC3B.hi vs P62.low
LC3B.hi vs P62.hi
P62.low vs P62.hi

But that produces a messy output and it just feels like its not the most efficient way of solving the problem. Surely Pandas has a built-in way of doing this kind of thing?

UPDATE: I have been trying to learn to use SQL commands to accomplish this task but pandasql does not recognize my column names. So basically:

print pysqldf("SELECT * FROM df;")

prints the table but,

print pysqldf("SELECT ATG12.low FROM df;")

prints "None"

I am definitely not committed to using SQL in Pandas to solve this problem. Will appreciate any and all suggestions or advice of how to list all the ID values that shows up in more than one column

Slavatron
  • 2,278
  • 5
  • 29
  • 40

2 Answers2

0

This is fairly ugly but it outputs a dataframe that will give you everything you might be looking for.

Each row in results_df gives the frequency with which the row index appears in the given column of the original dataframe.

import pandas as pd
df = pd.DataFrame({'A': [1,2,3],
                   'B': [3,4,5],
                   'C': [1,4,8],
                   'D': [3,7,2]})
unique_vals = pd.Series(df.values.ravel()).unique()
data_dict = {}
for i in unique_vals:
    row = []
    for v in df.columns:
        row.append( (df[v]==i).sum())
    data_dict[i] = row

results_df = pd.DataFrame(data_dict).T
results_df.columns = df.columns

results_df

   A  B  C  D
1  1  0  1  0
2  1  0  0  1
3  1  1  0  1
4  0  1  1  0
5  0  1  0  0
7  0  0  0  1
8  0  0  1  0

(Thanks to this answer for the unique_vals line.)

Community
  • 1
  • 1
exp1orer
  • 11,481
  • 7
  • 38
  • 51
0

Here's a solution that should hopefully run a little faster for a large dataset, as it doesn't implement a for loop:

import pandas as pd
dfData = pd.DataFrame({'LC3B.low':['PT 65','PT 86','PT 251'],'LC3B.hi':['PT 172','PT 65','PT 251'], 'P62.low':['PT 86','PT 38','PT 217'], 'P62.hi':['PT 135','PT 56','PT 261']})          

x =  dfData.stack().reset_index()
x.columns = ['A','Col','Val']
y = x.drop(['A'],axis = 1)

valCount = y.groupby(['Val']).count()
valCount.columns = ['ColumnCount']

mergedData = pd.merge(y,valCount, left_on ='Val', right_index=True) 

output_data = mergedData[mergedData['ColumnCount'] >1].drop(['ColumnCount'],axis = 1)

print output_data

    Col     Val
1  LC3B.low   PT 65
4   LC3B.hi   PT 65
3   P62.low   PT 86
5  LC3B.low   PT 86
8   LC3B.hi  PT 251
9  LC3B.low  PT 251
flyingmeatball
  • 7,457
  • 7
  • 44
  • 62