I have a this data set as sample:
df = pd.DataFrame({'CL1':['A B C','C A N']},
columns=['CL1','CL2','CL3','CL4'])
CL1 CL2 CL3 CL4
0 A B C NaN NaN NaN
1 C A N NaN NaN NaN
My Goal:Finding of most repetition of words combination in data frame with following steps.
-
- Make a separation of each value with (,) as separator and add in column
CL2
:
- Make a separation of each value with (,) as separator and add in column
CL1 CL2 CL3 CL4
0 'A B C' 'A,B,C' NaN NaN
1 'C A N' 'C,A,N' NaN NaN
-
- Separation of value in columns
CL2
in columnCL3
:
- Separation of value in columns
CL1 CL2 CL3 CL4
0 'A B C' 'A,B,C' 'A','B','C' NaN
1 'C A N' 'C,A,N' 'C','A','N' NaN
-
- Union (set theory from statistic) of column
CL4
- Union (set theory from statistic) of column
CL1 CL2 CL3 CL4
0 'A B C' 'A,B,C' 'A','B','C' [ [A],[B],[C],[A,B],[A,C],[B,C],[A,B,C] ]
1 'C A N' 'C,A,N' 'C','A','N' [ [C],[A],[N],[A,C],[C,N],[A,N],[C,A,N] ]
-
- Finding of the repetition of each value of column
CL4
in new columnCL5
in new data frame and add toCount
:
- Finding of the repetition of each value of column
CL5 Count
0 [A] 2
1 [B] 1
2 [C] 2
3 [D] 1
4 [N] 1
5 [A,B] 1
etc..