how to Convert list of tuple into column from text file

Question

I have a text file which contains a list of tuples. I want to convert this list into columns.

The file contains the following data:

[(0, u'0.025*"minimalism" + 0.018*"diwali" + 0.018*"sunday" + 0.018*"minimalistics" + 0.018*"plant" + 0.010*"thought" + 0.010*"take" + 0.010*"httpstcog21yvu1vyo" + 0.010*"time" + 0.010*"cause"'), 
 (1, u'0.029*"panshet" + 0.022*"im" + 0.015*"video" + 0.015*"project" + 0.015*"shade" + 0.015*"nature" + 0.015*"motionphotography\u2026" + 0.015*"motionjpeg" + 0.015*"trip" + 0.015*"lake"'),
 (2, u'0.013*"light" + 0.013*"take" + 0.013*"minimalist" + 0.013*"unm4sk" + 0.013*"first" + 0.013*"minimalism\u2026" + 0.013*"minimal" + 0.013*"possible" + 0.013*"quick" + 0.013*"story"')]

I want the output in THE following format:

topic 0         topic 1     topic 2
minimalism      panshet     light
diwali          im          take
sunday          video       minimalist
minimalistics   project     unm4sk
plant           shade       first

EDIT 1

with open('LDA.txt') as f:
    lis = [x.split() for x in f]

cols=[x for x in zip(*lis)]
for x in cols:
    print(x)

I'm guessing you have pandas? What have you tried? This is actually pretty simple. — cs95, Dec 13 '17 at 17:12
i already search lot of things n tried it but not getting answer plz post ur answer — aneeket, Dec 13 '17 at 17:16
I want to see some kind of effort. Where is your code? What links have you found? What did you try? I have the answer, but I'd prefer helping you after seeing your attempt. — cs95, Dec 13 '17 at 17:17
I don't even see a "please" in your question. You are not entitled to anyone's help here. Please show your efforts. — cs95, Dec 13 '17 at 17:19
@cᴏʟᴅsᴘᴇᴇᴅ Well, he did beg `plz` on the comment, but that just makes things worse — Matias Cicero, Dec 13 '17 at 17:21
See my answer. If there are any issues running `ast.literal_eval`, that means your file data is malformed, and I can't help you (because that's the fault of whoever saved the file). — cs95, Dec 13 '17 at 17:26

score 2 · Accepted Answer · answered Dec 13 '17 at 17:25

Your first mistake is the manner in which you load "data" from your text file (this isn't even the best way to save data. If you're saving python objects, best use pickle to do that).

Anyway, the fix is simple. When reading your file, call ast.literal_eval.

import ast

with open('LDA.txt') as f:
    data = ast.literal_eval(f.read())

Now comes the part you've been waiting for. You can extract words pretty easily with re.findall. For each tuple in your data, extract all words and store in a dictionary. Afterwards, pass the dictionary to the pd.DataFrame constructor.

import re
import pandas as pd

d = {}
for i, y in data:
    d['topic {}'.format(i)] = re.findall('"(.*?)"', y) 

df = pd.DataFrame(d)

df 
              topic 0             topic 1      topic 2
0          minimalism             panshet        light
1              diwali                  im         take
2              sunday               video   minimalist
3       minimalistics             project       unm4sk
4               plant               shade        first
5             thought              nature  minimalism…
6                take  motionphotography…      minimal
7  httpstcog21yvu1vyo          motionjpeg     possible
8                time                trip        quick
9               cause                lake        story

If you want other ways of tabulating data (without using a dataframe), see here (second answer).

score 0 · Answer 2 · answered Dec 14 '17 at 03:45

I think the output looks like the __str__ format of gensim LDA model output.

Instead of printing the topics out and saving the strings and then do post-processing:

from gensim import corpora, models, similarities
from gensim.models import hdpmodel, ldamodel
from itertools import izip

documents = ["Human machine interface for lab abc computer applications",
              "A survey of user opinion of computer system response time",
              "The EPS user interface management system",
              "System and human system engineering testing of EPS",
              "Relation of user perceived response time to error measurement",
              "The generation of random binary unordered trees",
              "The intersection graph of paths in trees",
              "Graph minors IV Widths of trees and well quasi ordering",
              "Graph minors A survey"]

# remove common words and tokenize
stoplist = set('for a of the and to in'.split())
texts = [[word for word in document.lower().split() if word not in stoplist]
         for document in documents]

# remove words that appear only once
all_tokens = sum(texts, [])
tokens_once = set(word for word in set(all_tokens) if all_tokens.count(word) == 1)
texts = [[word for word in text if word not in tokens_once]
         for text in texts]

dictionary = corpora.Dictionary(texts)
corpus = [dictionary.doc2bow(text) for text in texts]

model = models.LdaModel(corpus, id2word=dictionary, num_topics=100)
model.print_topics(3)

[out]:

[(51, '0.083*"response" + 0.083*"time" + 0.083*"graph" + 0.083*"trees" + 0.083*"eps" + 0.083*"computer" + 0.083*"survey" + 0.083*"interface" + 0.083*"user" + 0.083*"human"'), (48, '0.083*"response" + 0.083*"time" + 0.083*"graph" + 0.083*"trees" + 0.083*"eps" + 0.083*"computer" + 0.083*"survey" + 0.083*"interface" + 0.083*"user" + 0.083*"human"'), (42, '0.083*"response" + 0.083*"time" + 0.083*"graph" + 0.083*"trees" + 0.083*"eps" + 0.083*"computer" + 0.083*"survey" + 0.083*"interface" + 0.083*"user" + 0.083*"human"')]

You should use `models.LdaModel.top_topics()`:

model = models.LdaModel(corpus, id2word=dictionary, num_topics=100)
top3_topics = model.top_topics(corpus)[:3]
for topic, topic_score in top3_topics:
    word_scores, words = zip(*topic)
    top10_words = words[:10]
    print(top10_words)

[out]:

('time', 'response', 'user', 'computer', 'human', 'interface', 'system', 'survey', 'eps', 'trees')
('survey', 'minors', 'graph', 'computer', 'human', 'interface', 'user', 'system', 'time', 'response')
('computer', 'human', 'interface', 'user', 'system', 'time', 'survey', 'response', 'eps', 'trees')

And if you want to put them in a `pandas.DataFrame`:

>>> import pandas as pd
>>> 
>>> top10_words_per_topic = []
>>> for topic, topic_score in top3_topics:
...     word_scores, words = zip(*topic)
...     top10_words_per_topic.append(words[:10])
... 


>>> df = pd.DataFrame(top10_words_per_topic).transpose()
>>> df.rename(columns={0:'Topic0', 1:'Topic1', 2:'Topic2'})
      Topic0     Topic1     Topic2
0       time     survey   computer
1   response     minors      human
2       user      graph  interface
3   computer   computer       user
4      human      human     system
5  interface  interface       time
6     system       user     survey
7     survey     system   response
8        eps       time        eps
9      trees   response      trees

how to Convert list of tuple into column from text file

EDIT 1

2 Answers2

Instead of printing the topics out and saving the strings and then do post-processing:

You should use models.LdaModel.top_topics():

And if you want to put them in a pandas.DataFrame:

You should use `models.LdaModel.top_topics()`:

And if you want to put them in a `pandas.DataFrame`: