-2

I have a this data set as sample:

df = pd.DataFrame({'CL1':['A B C','C A N']},
                columns=['CL1','CL2','CL3','CL4']) 
  
     CL1  CL2  CL3  CL4  
0  A B C  NaN  NaN  NaN  
1  C A N  NaN  NaN  NaN  
 
         

My Goal:Finding of most repetition of words combination in data frame with following steps.

    1. Make a separation of each value with (,) as separator and add in column CL2:
     CL1     CL2     CL3  CL4 
0  'A B C'  'A,B,C'  NaN  NaN 
1  'C A N'  'C,A,N'  NaN  NaN 

    1. Separation of value in columns CL2 in column CL3:
     CL1     CL2     CL3          CL4 
0  'A B C'  'A,B,C'  'A','B','C'  NaN 
1  'C A N'  'C,A,N'  'C','A','N'  NaN 

     
    1. Union (set theory from statistic) of column CL4
     CL1     CL2     CL3          CL4 
0  'A B C'  'A,B,C'  'A','B','C'  [ [A],[B],[C],[A,B],[A,C],[B,C],[A,B,C] ] 
1  'C A N'  'C,A,N'  'C','A','N'  [ [C],[A],[N],[A,C],[C,N],[A,N],[C,A,N] ] 
       
    1. Finding of the repetition of each value of column CL4 in new column CL5 in new data frame and add to Count:
     CL5      Count   
0    [A]       2
1    [B]       1
2    [C]       2
3    [D]       1
4    [N]       1
5    [A,B]     1
etc..

Jsmoka
  • 59
  • 10
  • Seems you know what you want. It is a multi-step process as you think of it. What did you DO to get there? [mre]? Where are you stuck? – Patrick Artner Feb 05 '21 at 08:56

1 Answers1

2

You can use split by values by spacem then call custom function for all combinations and for counts use Series.explode with Series.value_counts:

df = pd.DataFrame({'CL1':['A B C','C A N','D E F','F X G']},
                         columns=['CL1','CL2','CL3','CL4']) 


#https://stackoverflow.com/a/5898031/2901002
from itertools import chain, combinations
def all_subsets(ss):
    return chain(*map(lambda x: combinations(ss, x), range(1, len(ss)+1)))    

df = (df['CL1'].apply(lambda x: list(all_subsets(x.split())))
               .explode()
               .value_counts()
               .rename_axis('CL5')
               .reset_index(name='count'))
print (df.head(10))
         CL5  count
0       (C,)      2
1       (F,)      2
2       (A,)      2
3     (E, F)      1
4     (F, G)      1
5     (A, B)      1
6     (C, A)      1
7     (A, C)      1
8  (F, X, G)      1
9       (D,)      1

df['CL5'] = df['CL5'].apply(list)
print (df.head(10))
         CL5  count
0        [C]      2
1        [F]      2
2        [A]      2
3     [E, F]      1
4     [F, G]      1
5     [A, B]      1
6     [C, A]      1
7     [A, C]      1
8  [F, X, G]      1
9        [D]      1
jezrael
  • 822,522
  • 95
  • 1,334
  • 1,252
  • But my problem is Step3 and 3. With `str.split()` I cant seprate the stings from each other – Jsmoka Feb 05 '21 at 09:03
  • @Jsmoka - Not understand, in your solution is repalced space by `,` and then split by `,`, in my solution are split by space. It is not better? Or there is some reason for replace ? – jezrael Feb 05 '21 at 09:05
  • The reason is that I became the DataSet from sql so: e.g. ('John,Do') and sometime ('John Do') – Jsmoka Feb 05 '21 at 09:14
  • 1
    yes, I say `'John Do'.replace(' ', ',').split(',')` is same like `'John Do'.split()` – jezrael Feb 05 '21 at 09:15
  • @Jsmoka - If need dont split by space is possible change data sample for explain it, what strings cannot be splitted? – jezrael Feb 05 '21 at 09:16
  • i receive the error: 'Series' object is not callable :(((( – Jsmoka Feb 05 '21 at 09:31
  • @Jsmoka - What part of code return error? – jezrael Feb 05 '21 at 09:33
  • Let us [continue this discussion in chat](https://chat.stackoverflow.com/rooms/228297/discussion-between-jsmoka-and-jezrael). – Jsmoka Feb 05 '21 at 09:34