0

Problem: I want to convert a list of list into a dataframe.

Setup: I have the following list:

data = [[(1,0.8),(2,0.2)],
       [(0,0.1),(1,0.3),(2,0.6)],
       [(0,0.05),(1,0.05),(2,0.3),(3,0.4),(4,0.2)]]

This is an LDA Document-Topic Probability List from gensim in which each list is a document and each tuple is one of five topic probabilities. (See an earlier question I posted on Stack Overflow here). The first element in the tuple represents the topic number, the second element is the probability that the topic probability for the document.

Note that while some documents (like the 3rd list) can have up to five tuples (topic probabilities), gensim LDA does not output probabilities for topics with less 0.01 probabilities. Therefore, examples like document 1 and document 2 have less than five tuples.

Goal: Use for loops to create a Document-Topic Probability matrix such that:

ProbMatrix = [(0,0.8,0.2,0,0),
        (0.1,0.3,0.6,0,0),
        (0.05,0.05,0.3,0.4,0.2)]

As noted above, for "missing" tuples (topics), zero's need to be plugged in. Once I get this list, I figure I can use pandas dataframe function to produce my final output (df) such that

df = pd.DataFrame(ProbMatrix)

My (Failed) Attempt:

ProbMatrix = []
for i in data:      #each document i
    for j in i:     #each topic j
        if j[0] == 0:
            ProbMatrix[i,0].append(j[1])
        elif j[0]  == 1:
            ProbMatrix[i,1].append(j[1])
        elif j[0]  == 2:
            ProbMatrix[i,2].append(j[1])   
        elif j[0]  == 3:
            ProbMatrix[i,3].append(j[1])   
        elif j[0]  == 4:
            ProbMatrix[i,4].append(j[1])  

The problem is how I'm referencing ProbMatrix because I'm receiving the following error:

TypeError: list indices must be integers, not tuple

Thank you for your help!

Bonus (that is, it'd be even better if you can help):

One problem I've found with gensim LDA is that, as mentioned, it does not output probabilities less than 0.01, even if minimum_probability = None. For example, see this earlier post. The example above is illustrative in that the topic probabilities sum to 1 for each document. However, in reality the output looks more like this:

data = [[(1,0.79),(2,0.2)],  # topic 1 probability 0.79 from 0.8
       [(0,0.09),(1,0.3),(2,0.6)], # topic 0 probability 0.09 from 0.1
       [(0,0.05),(1,0.05),(2,0.3),(3,0.4),(4,0.2)]]

What I'm looking for is instead of putting zero into unknown topic probabilities, instead make the remaining missing topics an even probability such that topic probabilities for each document equal 1. For example, this would result in a ProbMatrix:

ProbMatrix = [(0.0033,0.79,0.2,0.0033,0.0033),
        (0.09,0.3,0.6,0.005,0.005),
        (0.05,0.05,0.3,0.4,0.2)]
Community
  • 1
  • 1
Rhymenoceros
  • 119
  • 1
  • 8

4 Answers4

1

I'm not 100% sure what you are asking but I think this is what you are looking for to get the probmatrix list you showed. you can do it like this

data = [[(1,0.8),(2,0.2)],
       [(0,0.1),(1,0.3),(2,0.6)],
       [(0,0.05),(1,0.05),(2,0.3),(3,0.4),(4,0.2)]]
probmatrix = []

for i in data:
    tmp = [0,0,0,0,0]
    for j in i:
        tmp[j[0]] = j[1]
    probmatrix.append(tmp)

df = pd.DataFrame(probmatrix)
print df

      0     1    2    3    4
0  0.00  0.80  0.2  0.0  0.0
1  0.10  0.30  0.6  0.0  0.0
2  0.05  0.05  0.3  0.4  0.2

Since you know there will only be five elements you can make a tmp list initialized with 5 zeros and just replace the ones that are non-zero

SirParselot
  • 2,640
  • 2
  • 20
  • 31
0

Not sure if it what you want but i is a document, and you are using it to adress ProbMatrix. you can make ProbMatrix = {} instead of ProbMatrix = [] to use it as a dictionary.

PeCosta
  • 537
  • 4
  • 13
0

You cannot reference a list of list with [i,j], in your case it's a list of tuples. You should first have a list of list. Try:

ProbMatrix[i].append(j[1])  # add a number to the list at row i

Maybe I didn't get why you need 2 indices. In this case it should be:

ProbMatrix[i][j].append(j[1])
Kirell
  • 9,228
  • 4
  • 46
  • 61
0

If you know the desired shape of your output you can use np.zeros to create a zero filled Numpy array and fill accordingly.

import numpy as np
import pandas as pd

probMatrix = np.zeros(shape=(3,5))  # size of (num docs, k topics)

for doc_num, probs in enumerate(data):
    for k_index, prob in probs:
        probMatrix[doc_num, k_index] = prob

Which will return:

array([[ 0.  ,  0.8 ,  0.2 ,  0.  ,  0.  ],
   [ 0.1 ,  0.3 ,  0.6 ,  0.  ,  0.  ],
   [ 0.05,  0.05,  0.3 ,  0.4 ,  0.2 ]])

Which can be loaded directly into a pandas dataframe if needed, or is pretty useful just as it is.

leroyJr
  • 1,110
  • 9
  • 17