0

I have the following dataframe of topic document probablity matrix

    0             1         2             3         4       ...             77            78            79            80            81
1  0.0  9.941665e-23  0.001141  6.837607e-04  0.010396      ...       0.000071  6.475626e-10  1.641026e-02  2.494897e-08  2.017094e-02
2  1.0  2.735043e-03  0.004329  1.915713e-20  0.000202      ...       0.005399  1.367521e-02  1.816478e-12  1.641023e-02  1.366020e-10

where column 0 with values (0.0, 1.0) represents index for topic 1 and 2 respectively. The dataframe has 81 columns and 2 rows. I want to sum up all columns and get another dataframe. For example for column 1, the output would be sum(0.002735042735040934 + 1.7996105239810978e-15) and for all columns. I used

col_list = list(df)
df = df[col_list].sum(axis=0)

but it is only printing

1      0.0027350427350409341.7996105239810978e-15
2          0.0054700854694576.284676740939513e-13

which is not the output I want to be output. What is the correct way to do it? After sorting each values for all columns in descending order I want to output the topic rank for each document in such format.

   id      topic-rank
    1          1, 0
    2          1, 0
    3          0, 1
    4          0, 1
        ...
    80         0, 1
    81         1, 0

What is the appropriate way to do that?

Samuel Mideksa
  • 423
  • 8
  • 19

1 Answers1

4

Problem is values are strings, so need first convert them to floats:

s = df.astype(float).sum()
print (s)
1     0.002735
2     0.005470
80    0.016410
81    0.020171
dtype: float64

EDIT: Use DataFrame.div for division:

df = df.astype(float)

df1 = df.div(df.sum())
print (df1)
              1             2        80            81
1  1.000000e+00  1.000000e+00  0.998241  4.151430e-10
2  6.579826e-13  1.148917e-10  0.001759  1.000000e+00
jezrael
  • 822,522
  • 95
  • 1,334
  • 1,252
  • 1
    Also I want to divide each value in the original dataframe with the summed value in the new dataframe, like for column one we have `0.002735042735040934` / `0.002735` as first row and `1.7996105239810978e-15` / `0.002735` as second row and like fo for each column. Can you include this comment in your answer? – Samuel Mideksa Jan 29 '19 at 08:13
  • Also I want to sort each values in the resulting dataframe i.e. after the division in descending order just to rank them. Can you also include this comment in your answer? – Samuel Mideksa Jan 29 '19 at 08:45
  • @SamuelMideksa - hmmm, can you specify sorting? Do you need first value (`1,1`) - `1.000` and last `(2,81)` - `6.579826e-13` ? Or need sort by some column? Or some row? – jezrael Jan 29 '19 at 08:48
  • I mean sorting values in each column for example for column 1 I want to sort values `1.000000e+00` and `6.579826e-13` in descending order and for rest of columns also(ranking the values). – Samuel Mideksa Jan 29 '19 at 08:52
  • 1
    Do you need `df2 = pd.DataFrame(-np.sort(-df1, axis=0), columns=df.columns, index=df.index)` ? – jezrael Jan 29 '19 at 08:58
  • I improved the question. Can you take a look at it? – Samuel Mideksa Jan 30 '19 at 09:16
  • @SamuelMideksa - Please create new question, also `3 0, 1` is correct? – jezrael Jan 30 '19 at 09:35
  • @SamuelMideksa - sorry, I dont understand your output. Can you explain it in new question too? Because if sorting then how working rank? – jezrael Jan 30 '19 at 09:38
  • Ok here is the new question https://stackoverflow.com/questions/54437769/how-to-rank-values-in-a-dataframe-with-indexes – Samuel Mideksa Jan 30 '19 at 10:04