I have a data frame that is 11 rows by 17604 columns. The number of rows can vary as I change my clustering.
B42D2033/26 G02B27/2214 G02F1/133753 G02F1/133707 G02F1/1341 G02F1/1339 G02F1/133371 G02B6/005 C08G73/12 G02F1/1303 ... G06F17/30035 G06F21/629 B65B3/26 E04D13/00 G06F17/30952 G07C9/00912 F02C9/28 G06F17/28 G06F17/30964 G06F21/82
Cluster
C1 0.000000 1.000000 0.000000 0.000000 0.000000 1.000000 0.000000 0.000000 0.000000 0.000000 ... 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000
C10 0.000000 3.250000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 ... 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000
C11 0.020619 1.149485 0.262887 0.829897 0.551546 1.030928 0.082474 1.175258 0.005155 0.216495 ... 0.005155 0.010309 0.005155 0.005155 0.005155 0.005155 0.005155 0.005155 0.005155 0.005155
C2 0.000000 1.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 ... 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000
C3 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 ... 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000
C4 0.055556 13.500000 8.333333 24.555556 13.166667 26.666667 3.277778 4.222222 0.000000 2.388889 ... 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000
C5 0.000000 0.750000 0.000000 0.000000 0.000000 0.500000 0.000000 0.250000 0.000000 0.000000 ... 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000
C6 0.032258 3.451613 0.000000 0.000000 0.000000 0.387097 0.000000 0.064516 0.000000 0.000000 ... 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000
C7 0.000000 0.000000 0.250000 0.000000 0.000000 0.250000 0.000000 0.000000 0.000000 1.500000 ... 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000
C8 0.000000 0.076923 0.153846 0.346154 0.000000 0.884615 0.461538 0.192308 0.038462 0.076923 ... 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000
C9 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 ... 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000
I would like to produce a dictionary or Series for each cluster based on the value in the column. For example, all column where the value !=0
might look, in dictionary form like:
{'C1', ['G02B27/2214', 'G02F1/1339']}
How can I produce a series for each cluster row where the value is equal to 'some value' or a range of values?
I did look at Select rows from a DataFrame based on values in a column in pandas, but that solution isn't for all columns in a row.
EDIT:
I realized that I can transpose the df
and do something like:
df_clusters.T[df_clusters.T['C1']>0]
Which returns a df
with every row where 'C1' is greater than 0. I suppose I could drop the other cluster columns, but I don't think this is the best solution.