1

I have a dataframe with binary data and I know there are dependency across columns. I want to remove dependent columns and only want to remain with independent columns. An example input is as follows:

Test ,A,B,C,D,E,F,G,H,I,J,K,L,M,N,O,P
test1,0,0,0,0,0,0,0,1,1,1,1,1,0,1,1,1
test2,1,1,1,0,1,1,1,1,1,1,1,1,1,0,0,1
test3,1,1,1,0,1,1,1,1,1,1,1,1,1,0,0,1
test4,1,1,1,1,0,0,1,1,1,1,1,1,1,0,0,1
test5,1,1,1,1,0,0,1,1,1,1,1,1,1,0,0,1

Here we see, (A,B,C,G,M), (D), (E,F), (H,I,J,K,L,P) and (N, O) are the columns that groups having same values or dependent columns. Finally I want to get the following columns:

Test,A,D,E,H,N test1,0,0,0,1,1 test2,1,0,1,1,0 test3,1,0,1,1,0 test4,1,1,0,1,0 test5,1,1,0,1,0

I am trying to use PCA in python but not able to achieve it. Can someone guide me on how to achieve this?

EDIT: Here is the example code I am using

import pandas as pd 
import numpy as np 
from sklearn.decomposition import PCA

df = pd.read_csv("TestInput.csv")
print(df)
pca = PCA()

#Remote the header and the row names
numDf = df.iloc[:,1:]
print(pca.fit(numDf))
T=pca.transform(numDf)

print("Number of unique columns are:", T.shape[1])
print(np.cumsum(pca.explained_variance_ratio_))

Thanks.

Rachit Agrawal
  • 3,203
  • 10
  • 32
  • 56
  • You need not use PCA for this, because the values indicate clearly that the columns are very similiar.. Why not compare the values in the columns and drop duplicate ones? – Anand C U Nov 01 '17 at 05:08
  • @AnandCU In this example, the number of columns are limited but in my original problem I have 100000 columns and 100000 rows. So, doing a similarity test on such a big dataframe will take time. – Rachit Agrawal Nov 01 '17 at 05:14
  • Have you tried PCA? Where are you stuck? Also check this answer out https://stackoverflow.com/a/14985695/5026636 . Give it a try. – Anand C U Nov 01 '17 at 05:20
  • @AnandCU Added the code I am using. – Rachit Agrawal Nov 01 '17 at 05:50

1 Answers1

0

Converting this comment into an answer, find and drop duplicate columns with drop_duplicates.

df = df.set_index('Test')
df.T.drop_duplicates(keep='first').T

       A  D  E  H  N
Test                
test1  0  0  0  1  1
test2  1  0  1  1  0
test3  1  0  1  1  0
test4  1  1  0  1  0
test5  1  1  0  1  0
cs95
  • 379,657
  • 97
  • 704
  • 746