0

I have the following dataframe of topic document probablity matrix with the first row being names of text files.

                       1                      2            ...                               80                      81
0                778.txt                856.txt           ...                          831.txt                 850.txt
1   0.002735042735042732  0.0054700854700846634           ...              0.01641025640567632  4.2490294446698094e-09
2  2.146512500161246e-28  8.006312700113502e-16           ...            4.580074538571013e-12     0.02017093592191074

where column 0 with values (0.0, 1.0) represents index for topic 1 and 2 respectively.After sorting each column(decsending)

def rank_topics_by_probability(self):
    df = df.astype(float)
    df2 = pd.DataFrame(-np.sort(-df, axis=0), columns=df.columns, index=df.index)
    return df2

I got the following output

     0             1         2             3         4       ...             77            78            79            80            81
1  1.0  2.735043e-03  0.004329  6.837607e-04  0.010396      ...       0.005399  1.367521e-02  1.641026e-02  1.641023e-02  2.017094e-02
2  0.0  9.941665e-23  0.001141  1.915713e-20  0.000202      ...       0.000071  6.475626e-10  1.816478e-12  2.494897e-08  1.366020e-10

I want to display topic-document rank matrix for each document such as

     id      topic-rank
    778        1, 0
    856        1, 0
    835        0, 1
    786        0, 1
        ...
    831        0, 1
    850        1, 0

For document with id 1 I assigned 1, 0 because probability of topic 2 is greater than topic 1 and so on. What is the way to do that? Sample data for the edited question these are only the head() values of the dataframe.

      id                                               text
0  15623  Y:\n1. Ran preliminary experiments to set para...
1  15625  Scrum Minutes- Hersheys\nPresent: Eyob, Masres...
2  15627  Present: Eyob, Masresha,  Zelalem\nhersheys:\n...
3  15628  **********************************************...
4  15629  Scrum Minutes- Hersheys\nPresent: Eyob, Masres...
Samuel Mideksa
  • 423
  • 8
  • 19

1 Answers1

1

Use argsort with descending ordering for positions with DataFrame constructor:

#create index by first column and transpose
df2 = df.set_index(0).T

arr = df2.columns.values[(-df2.values).argsort()]
df2 = pd.DataFrame({'id': df2.index, 
                    'score1': arr[:, 0].astype(int),
                    'score2': arr[:, 1].astype(int)})
print (df2)
   id  score1  score2
0   1       1       0
1   2       1       0
2   3       0       1
3   4       0       1
4  77       1       0
5  78       1       0
6  79       0       1
7  80       1       0
8  81       0       1

EDIT:

df2 = df.set_index(0).T

arr = df2.columns.values[(-df2.values).argsort()]

score = (pd.Series(arr[:, 0].astype(int).astype(str)) + ', ' + 
         pd.Series(arr[:, 1].astype(int).astype(str)))
df2 = pd.DataFrame({'id': df2.index, 
                    'score': score})
print (df2)
   id score
0   1  1, 0
1   2  1, 0
2   3  0, 1
3   4  0, 1
4  77  1, 0
5  78  1, 0
6  79  0, 1
7  80  1, 0
8  81  0, 1

EDIT1:

df2 = df.T.set_index(0).astype(float)
print (df2)
                    1             2
0                                  
778.txt  2.735043e-03  2.146513e-28
856.txt  5.470085e-03  8.006313e-16
831.txt  1.641026e-02  4.580075e-12
850.txt  4.249029e-09  2.017094e-02


arr = (-df2.values).argsort()

score = (pd.Series(arr[:, 0].astype(str)) + ', ' + 
         pd.Series(arr[:, 1].astype(str)))
df2 = pd.DataFrame({'id': df2.index.str.replace('\.txt',''), 
                    'score': score})
print (df2)
    id score
0  778  0, 1
1  856  0, 1
2  831  0, 1
3  850  1, 0
jezrael
  • 822,522
  • 95
  • 1,334
  • 1,252
  • Can I get values for score 1 and score 2 attributes in one column(topic-rank for example) separeted by commas? – Samuel Mideksa Jan 30 '19 at 11:02
  • @SamuelMideksa - No before. Btw, how is created input DataFrame? – jezrael Jan 30 '19 at 14:16
  • The dataframe is created by this code `df = pd.read_csv(pplsa.PLSA_PARAMETERS_PATH + 'topic-by-doc-matirx.csv', sep=',', header=None)` by reading from a csv file. – Samuel Mideksa Jan 31 '19 at 07:10
  • 1
    @SamuelMideksa - So change it by `df = pd.read_csv(pplsa.PLSA_PARAMETERS_PATH + 'topic-by-doc-matirx.csv', sep=',`) - Then first row from csv is columns in DataFrame, then change `df2 = df.T.set_index(0).astype(float)` to `df2 = df.T` – jezrael Jan 31 '19 at 07:12
  • It works but I want to remove the '.txt' in the id column and display only the numeric values. How can I remove the '.txt' value? – Samuel Mideksa Jan 31 '19 at 07:51
  • 1
    @SamuelMideksa - so use `df.columns = df.columns.str.replace('\.txt','')` – jezrael Jan 31 '19 at 07:52
  • https://stackoverflow.com/questions/54479068/how-to-join-two-dataframes-using-index-in-pandas – Samuel Mideksa Feb 01 '19 at 12:01
  • Sorry to bother you. I am asking that on the line calculating score, the code is manually adding `pd.Series(arr[:, 0].astype(str)) + ', ' + pd.Series(arr[:, 1].astype(str))` for topics 0 and 1 respectively. Is there anyway to automate this using a for loop adding the series for n number of topics. – Samuel Mideksa Feb 14 '19 at 16:24
  • 1
    @Samuel Mideksa in my opinion the best is create for each rank new column like `df2 = pd.DataFrame(arr, index = df2. index)`, but if really need join all values then use `df2 = pd.DataFrame({'id': df2.index.str.replace('\.txt',''), 'score': pd.DataFrame(arr).astype(str).apply(','.join, axis=1)})`. I am offline, on phone only, so untested. – jezrael Feb 14 '19 at 18:00
  • It works I have new_df with `id text 0 17337 Hi <!channel> -- interesting business news is ... 1 17338 <@U04JNBU9W>: <@U04JL900N> is already working ... 2 17339 Good news. This Chinese server is @ 120.39.251... 3 17340 Its good news. The task will continue to be pa... 4 17341 good news, keep up the good work` and I want to assign `df2.index` with `new_df.id` and doing that I got some lower rows with 'NaN' values. How can I avoid that? – Samuel Mideksa Feb 15 '19 at 07:45
  • @SamuelMideksa - Can you modify question with sample data? Because very bad formating in comments. – jezrael Feb 15 '19 at 07:46
  • @SamuelMideksa - But there is text, strings, so is not possible use `argsort`, because get `TypeError: '<' not supported between instances of 'int' and 'str'`. Need only numeric columns – jezrael Feb 15 '19 at 07:53
  • My primary target is not to use the `text` column but the `id` column as an index to `df2 = pd.DataFrame({'id': new_df.id, 'score': pd.DataFrame(arr).astype(str).apply(','.join, axis=1)})` assuming that new_df is the above dataframe. – Samuel Mideksa Feb 15 '19 at 07:58
  • @SamuelMideksa - So need join both DataFrames together by `id` ? – jezrael Feb 15 '19 at 07:59
  • @jezreal Exactly That is what I am looking for – Samuel Mideksa Feb 15 '19 at 10:07
  • 1
    OK, so use `df = pd.concat([df1, df2], axis=1)` – jezrael Feb 15 '19 at 10:08