I have a dataframe with binary data and I know there are dependency across columns. I want to remove dependent columns and only want to remain with independent columns. An example input is as follows:
Test ,A,B,C,D,E,F,G,H,I,J,K,L,M,N,O,P
test1,0,0,0,0,0,0,0,1,1,1,1,1,0,1,1,1
test2,1,1,1,0,1,1,1,1,1,1,1,1,1,0,0,1
test3,1,1,1,0,1,1,1,1,1,1,1,1,1,0,0,1
test4,1,1,1,1,0,0,1,1,1,1,1,1,1,0,0,1
test5,1,1,1,1,0,0,1,1,1,1,1,1,1,0,0,1
Here we see, (A,B,C,G,M), (D), (E,F), (H,I,J,K,L,P) and (N, O)
are the columns that groups having same values or dependent columns. Finally I want to get the following columns:
Test,A,D,E,H,N
test1,0,0,0,1,1
test2,1,0,1,1,0
test3,1,0,1,1,0
test4,1,1,0,1,0
test5,1,1,0,1,0
I am trying to use PCA in python but not able to achieve it. Can someone guide me on how to achieve this?
EDIT: Here is the example code I am using
import pandas as pd
import numpy as np
from sklearn.decomposition import PCA
df = pd.read_csv("TestInput.csv")
print(df)
pca = PCA()
#Remote the header and the row names
numDf = df.iloc[:,1:]
print(pca.fit(numDf))
T=pca.transform(numDf)
print("Number of unique columns are:", T.shape[1])
print(np.cumsum(pca.explained_variance_ratio_))
Thanks.