0

I am using LDA over a simple collection of documents. my goal is to extract topics, then use the extracted topics as features to evaluate my model.

I decided to use multinomial SVM as the evaluater. not sure its good or not?

import itertools
from gensim.models import ldamodel
from nltk.tokenize import RegexpTokenizer
from nltk.stem.porter import PorterStemmer
from gensim import corpora, models
from sklearn.naive_bayes import MultinomialNB

tokenizer = RegexpTokenizer(r'\w+')

# create English stop words list
en_stop = {'a'}

# Create p_stemmer of class PorterStemmer
p_stemmer = PorterStemmer()

# create sample documents
doc_a = "Brocolli is good to eat. My brother likes to eat good brocolli, but not my mother."
doc_b = "My mother spends a lot of time driving my brother around to baseball practice."
doc_c = "Some health experts suggest that driving may cause increased tension and blood pressure."
doc_d = "I often feel pressure to perform well at school, but my mother never seems to drive my brother to do better."
doc_e = "Health professionals say that brocolli is good for your health."

# compile sample documents into a list
doc_set = [doc_a, doc_b, doc_c, doc_d, doc_e]

# list for tokenized documents in loop
texts = []

# loop through document list
for i in doc_set:
    # clean and tokenize document string
    raw = i.lower()
    tokens = tokenizer.tokenize(raw)

    # remove stop words from tokens
    stopped_tokens = [i for i in tokens if not i in en_stop]

    # stem tokens
    stemmed_tokens = [p_stemmer.stem(i) for i in stopped_tokens]

    # add tokens to list
    texts.append(stemmed_tokens)

# turn our tokenized documents into a id <-> term dictionary
dictionary = corpora.Dictionary(texts)

# convert tokenized documents into a document-term matrix
corpus = [dictionary.doc2bow(text) for text in texts]


# generate LDA model
#ldamodel = gensim.models.ldamodel.LdaModel(corpus, num_topics=2, id2word=dictionary, passes=20)

id2word = corpora.Dictionary(texts)
# Creates the Bag of Word corpus.
mm = [id2word.doc2bow(text) for text in texts]

# Trains the LDA models.
lda = ldamodel.LdaModel(corpus=mm, id2word=id2word, num_topics=4,
                               update_every=1, chunksize=10000, passes=1)


# Assigns the topics to the documents in corpus
a=[]
lda_corpus = lda[mm]
for i in range(len(doc_set)):
    a.append(lda_corpus[i])
    print(lda_corpus[i])
merged_list = list(itertools.chain(*lda_corpus))
print(a)
    #my_list.append(my_list[i])


sv=MultinomialNB()

yvalues = [0,1,2,3]

sv.fit(a,yvalues)
predictclass = sv.predict(a)

testLables=[0,1,2,3]
from sklearn import metrics, tree
#yacc=metrics.accuracy_score(testLables,predictclass)
#print (yacc)

when I run this code it throws the error mentioned in the subject.

Also this is the output of LDA model(topic doc distribution) that I feed to SVM:

[[(0, 0.95533888404477663), (1, 0.014775921798986477), (2, 0.015161897773308793), (3, 0.014723296382928375)], [(0, 0.019079556242721694), (1, 0.017932434792585779), (2, 0.94498655991579728), (3, 0.018001449048895311)], [(0, 0.017957955483631164), (1, 0.017900184473362918), (2, 0.018133572636989413), (3, 0.9460082874060165)], [(0, 0.96554611572184923), (1, 0.011407838337200715), (2, 0.011537900721487016), (3, 0.011508145219463113)], [(0, 0.023306931039431281), (1, 0.022823706054846005), (2, 0.93072240824085961), (3, 0.023146954664863096)]]

My labels here are 0,1,2,3 .

I found a response here

but when I write down :

nsamples, nx, ny = a.shape
d2_train_dataset = a.reshape((nsamples,nx*ny))

According to my case, it does not work. actually a does not have shape method.

whole traceback error

Traceback (most recent call last):
  File "/home/saria/PycharmProjects/TfidfLDA/test3.py", line 87, in <module>
    sv.fit(a,yvalues)
  File "/home/saria/tfwithpython3.6/lib/python3.5/site-packages/sklearn/naive_bayes.py", line 562, in fit
    X, y = check_X_y(X, y, 'csr')
  File "/home/saria/tfwithpython3.6/lib/python3.5/site-packages/sklearn/utils/validation.py", line 521, in check_X_y
    ensure_min_features, warn_on_dtype, estimator)
  File "/home/saria/tfwithpython3.6/lib/python3.5/site-packages/sklearn/utils/validation.py", line 405, in check_array
    % (array.ndim, estimator_name))
ValueError: Found array with dim 3. Estimator expected <= 2.
sariii
  • 2,020
  • 6
  • 29
  • 57
  • What piece of the code is causing the error? – vielkind Aug 08 '17 at 18:19
  • @vealkind thanks for the comment :) I updated the question with traceback, many thanks for your time – sariii Aug 08 '17 at 18:35
  • @saria- within `a` you have the data stored as a list of tuples for each document. Is the numbering 0, 1, 2, 3 in each document meaningful? I think you're running into an issue since `MultinomialNB` is expecting a matrix-like object as input and you're using a list of tuples as input for each document. – vielkind Aug 08 '17 at 18:48
  • @vealkind thanks, yes they are topic0, topic1,topic2, topic3, so they are supposed to be the labels. I mean, for exmple LDA model generated 4 topics (each topic has 10 distribution words) so these topics are going to be behaved as the features for SVM, they are the labels for realizing the accuracy – sariii Aug 08 '17 at 19:00
  • Also a is a matrix which rows are my documents, and columns are the features – sariii Aug 08 '17 at 19:03
  • You can drop the topic labels from `a` as those are not required to fit the model. Then `a` can be reconstructed into a 2-dimensional matrix where, for example, the row for doc1 would be simply `[0.955, 0.014, 0.015, 0.014]`. and the total matrix would have a row for each doc and 4 columns representing the values for topic0, topic1, topic2, and topic3. With your data in that format you'll be able to fit `MultinomialNB` against your y-values. – vielkind Aug 08 '17 at 19:12
  • @vealkind many many thanks for your answer :) . may I ask you to provide me with a solution, I know how to remove paranthesis() but how can I manage just some part of numbers as 0 1 2 3? – sariii Aug 08 '17 at 19:16

1 Answers1

1

The error trying to call the fit on MultinomialNB is being raised because the data contained in a is in greater than 2-dimensions. As constructed now a is feeding a list of tuples for each document, which is not allowed by the model.

Since the first part of the tuple is just the topic label you can remove that value from the tuple and reconstruct your data into a 2-dimensional matrix. The code below will do that,

new_a = []
new_y = []
for x in a:
    temp_a = []
    sorted_labels = sorted(x, key=lambda x: x[1], reverse=True)
    new_y.append(sorted_labels[0][0])
    for z in x:
        temp_a.append(z[1])
    new_a.append(temp_a)

new_a will be the list of documents where each document will contain the scores for topics 0, 1, 2, and 3. You can then call sv.fit(new_a, yvalues) to fit your model.

vielkind
  • 2,840
  • 1
  • 16
  • 16
  • I do not know how to thank you, Really Really Really thanks for your help I was kind of disappointing. Also may I ask you to answer my question till tomorrow if I had any. I am going to change this code to read from a collection of files rather some examples in the code, I may face error tough I hope it goes well without any problem. again many thankss life saver :) – sariii Aug 09 '17 at 00:35
  • Not a problem at all! Let me know if you have any issues. One thing to look out for is in the code above there are two assumptions I made based on your example data that you'll want to be sure hold for your larger dataset. First, I assumed that all documents have a value for each topic (0, 1, 2, 3 etc.). Second, I assumed that within each document the topics were already sorted and in sequential order. If either of these are not true for the larger dataset some minor adjustments will be made. – vielkind Aug 09 '17 at 12:37
  • many thank for following my problem. Actually, its hard to explain one issue is that for labeling. when we use LDA, we have topics generated with terms distribution in each of them. finally, I have a document which labeled with a topic as its label. so in out matrix the rows are DOC but the columns should be features.so far every thing is fine. the problem is SVM needs the label for each row.so when I want 4 labels, and I have 30 rows, i should give 30 labels to run. it seems when I am creating the matrix I have to change that in a way one column goes for labels. is my saying clear? – sariii Aug 09 '17 at 23:50
  • I mean I shouldnt lose the topic label. when constructing the new matrix one column should be added and save that data there. may I have your view :) thanks – sariii Aug 10 '17 at 03:02
  • og Igot it, this is multi-labeling, but I dont know if the classification methods understand the probability here instead of integer number which describe the number a word is in a document?, because I can saw matrixes of integers not matrixes of real as probabilities – sariii Aug 10 '17 at 16:51
  • I'd really need a detailed description of the dataset you're working with... specifically what the features are, how the features are generated, and what outcome(s) you are trying to predict. My initial understand was each label could be represented as a column in the matrix (i.e. a column for topic1, topic2, topic3 etc.) where each document would have a value for each topic, and that matrix would be the input into the model to help predict some other kind of flag. Please let me know if my understanding is incorrect. – vielkind Aug 10 '17 at 18:31
  • again thanks for your contribution, you are right. till now we are in the same page, my question is that why if I have 5 topics, but 30 samples or rows it wants me to create 31 as labels, this is my question related to your answer. for example I have 30 doc so 30 rows, also I have 10 topic . when I run program I have to have this label yvalues = [0,1,1,1,1,2,2,2,2,3,3,4,6,4,5,5,6,7,6,3,4,4,5,6,7,8,9,8,7,6] to be runned successfully otherwise it throws error ValueError: Found input variables with inconsistent numbers of samples: [31, 10] – sariii Aug 10 '17 at 19:22
  • 1
    When you call `fit` on `MultinomialNB` you have to provide a label for each document. The model cannot be fit to your data without providing the correct label for each document. Do you have a set of documents where you know the labels to each document? If you do not have correct labels for a set of documents to train your model against then you should try some kind of clustering algorithm that will return clusters of similar documents based on only your input data with no labels. – vielkind Aug 10 '17 at 19:36
  • happy we are here, the story is that the matrix we updated had the label. but it wasnt a good format for NB, for example part of LDA output was: [[(0, 0.95533888404477663), (1, 0.014775921798986477), (2, 0.015161897773308793), (3, 0.014723296382928375)]. it means DOC1 belongs to topic 0 because it has the highest probability. but we removed this information. so I think when it is editing to new matrix it should update the matrix in a way add a column and keep this data according to each DOC – sariii Aug 10 '17 at 19:45
  • actually I think we should create another list at the same time when we are updating the matrix, in a way the order in the new list is correct. this is one example for multilabled http://scikit-learn.org/stable/modules/multiclass.html . for example I may have something like this [(1, 0.79821891501997788), (4, 0.19951589314184426)] so the newlist should be [[1,4],[...]] – sariii Aug 10 '17 at 20:02
  • I updated the code to include a `new_y` variable that will create a list of the top label for each document that will be retrieved by sorting the list of tuples and getting the topic with the highest score. – vielkind Aug 10 '17 at 20:08
  • I know I am asking alot sorry for that, the threshold should be 0.005. so as I said I may have an input with various labels as they are all higher than 0.005 . – sariii Aug 10 '17 at 20:13
  • It raises error ValueError: bad input shape (30, 2) as the new_y is [(0, 0.00075542918989857251), (0, 0.00075542918989857251), (0, 0.00075542918989857251), it just should be a sequence of correct lables with higher probablity. probability higher than 0.005 . so a matrix again – sariii Aug 10 '17 at 20:15
  • I did not get the part you added sadly. for exapme the output is like [[0,1],[2],[1,2,3],[0]] and these are the indices with the probability higher than 0.005. Its my last question regarding this defenitely :) – sariii Aug 10 '17 at 20:36
  • Thanks it resolved till now, i hope no more issue i face :| thanks again @vealkind – sariii Aug 11 '17 at 01:41