1

I have a data frame that is 11 rows by 17604 columns. The number of rows can vary as I change my clustering.

    B42D2033/26 G02B27/2214 G02F1/133753    G02F1/133707    G02F1/1341  G02F1/1339  G02F1/133371    G02B6/005   C08G73/12   G02F1/1303  ... G06F17/30035    G06F21/629  B65B3/26    E04D13/00   G06F17/30952    G07C9/00912 F02C9/28    G06F17/28   G06F17/30964    G06F21/82
Cluster                                                                                 
C1  0.000000    1.000000    0.000000    0.000000    0.000000    1.000000    0.000000    0.000000    0.000000    0.000000    ... 0.000000    0.000000    0.000000    0.000000    0.000000    0.000000    0.000000    0.000000    0.000000    0.000000
C10 0.000000    3.250000    0.000000    0.000000    0.000000    0.000000    0.000000    0.000000    0.000000    0.000000    ... 0.000000    0.000000    0.000000    0.000000    0.000000    0.000000    0.000000    0.000000    0.000000    0.000000
C11 0.020619    1.149485    0.262887    0.829897    0.551546    1.030928    0.082474    1.175258    0.005155    0.216495    ... 0.005155    0.010309    0.005155    0.005155    0.005155    0.005155    0.005155    0.005155    0.005155    0.005155
C2  0.000000    1.000000    0.000000    0.000000    0.000000    0.000000    0.000000    0.000000    0.000000    0.000000    ... 0.000000    0.000000    0.000000    0.000000    0.000000    0.000000    0.000000    0.000000    0.000000    0.000000
C3  0.000000    0.000000    0.000000    0.000000    0.000000    0.000000    0.000000    0.000000    0.000000    0.000000    ... 0.000000    0.000000    0.000000    0.000000    0.000000    0.000000    0.000000    0.000000    0.000000    0.000000
C4  0.055556    13.500000   8.333333    24.555556   13.166667   26.666667   3.277778    4.222222    0.000000    2.388889    ... 0.000000    0.000000    0.000000    0.000000    0.000000    0.000000    0.000000    0.000000    0.000000    0.000000
C5  0.000000    0.750000    0.000000    0.000000    0.000000    0.500000    0.000000    0.250000    0.000000    0.000000    ... 0.000000    0.000000    0.000000    0.000000    0.000000    0.000000    0.000000    0.000000    0.000000    0.000000
C6  0.032258    3.451613    0.000000    0.000000    0.000000    0.387097    0.000000    0.064516    0.000000    0.000000    ... 0.000000    0.000000    0.000000    0.000000    0.000000    0.000000    0.000000    0.000000    0.000000    0.000000
C7  0.000000    0.000000    0.250000    0.000000    0.000000    0.250000    0.000000    0.000000    0.000000    1.500000    ... 0.000000    0.000000    0.000000    0.000000    0.000000    0.000000    0.000000    0.000000    0.000000    0.000000
C8  0.000000    0.076923    0.153846    0.346154    0.000000    0.884615    0.461538    0.192308    0.038462    0.076923    ... 0.000000    0.000000    0.000000    0.000000    0.000000    0.000000    0.000000    0.000000    0.000000    0.000000
C9  0.000000    0.000000    0.000000    0.000000    0.000000    0.000000    0.000000    0.000000    0.000000    0.000000    ... 0.000000    0.000000    0.000000    0.000000    0.000000    0.000000    0.000000    0.000000    0.000000    0.000000

I would like to produce a dictionary or Series for each cluster based on the value in the column. For example, all column where the value !=0 might look, in dictionary form like:

{'C1', ['G02B27/2214', 'G02F1/1339']}

How can I produce a series for each cluster row where the value is equal to 'some value' or a range of values?

I did look at Select rows from a DataFrame based on values in a column in pandas, but that solution isn't for all columns in a row.

EDIT: I realized that I can transpose the df and do something like:

df_clusters.T[df_clusters.T['C1']>0]

Which returns a df with every row where 'C1' is greater than 0. I suppose I could drop the other cluster columns, but I don't think this is the best solution.

Britt
  • 539
  • 1
  • 7
  • 21

2 Answers2

2

Idea is create indices of values per condition, then create new DataFrame and get list per indices in list and then convert to dict:

i, c = np.where(df > 0)
d = pd.DataFrame({'a':df.index[i], 'b':df.columns[i]}).groupby('a')['b'].apply(list).to_dict()
print (d)

Another solution is use DataFrame.stack or DataFrame.melt for reshape, filter by boolean indexing or DataFrame.query and last create lists with dict:

s = df.stack()
d = s[s > 0].reset_index().groupby('Cluster')['level_1'].apply(list).to_dict()

d = (df.reset_index()
       .melt('Cluster', value_name='v1', var_name='v2')
       .query('v1 > 0')
       .groupby('Cluster')['v2']
       .apply(list)
       .to_dict())
jezrael
  • 822,522
  • 95
  • 1,334
  • 1,252
  • The first two solutions work great, but the last returns a typeerror (TypeError: '>' not supported between instances of 'str' and 'int') for the line `.query('value > 0')` – Britt Apr 02 '19 at 13:49
  • @Britt - tested in pandas 0.24.2, but is possible explicitely set new columns in `melt` function, edited answer. – jezrael Apr 02 '19 at 13:51
  • 1
    updated code for solution 3 works now. Thank you @Jezrael – Britt Apr 02 '19 at 13:52
0

Try:

df.apply(lambda x: df.columns[x>0].tolist(), axis = 1).to_dict()
sentence
  • 8,213
  • 4
  • 31
  • 40
  • I am using pandas version 0.24.2 and this line of code results in an error: `IndexError: ('boolean index did not match indexed array along dimension 0; dimension is 17607 but corresponding boolean dimension is 17604', 'occurred at index C1')` – Britt Apr 02 '19 at 13:57
  • Unfortunately I do not have enough information about your dataframe. Are all values float? – sentence Apr 02 '19 at 14:20