Here is an example of my data set
d = {'TEXT': ['History: A 59 year old female, was sent to R/O lung nodule. Findings: Lungs and airway: The study reveals a speculated nodule with pleural tagging at anterior basal segment of LLL, measured 1.9x1.4x2.0 cm in size. Pleural tagging is seen. Partial encasement of subsegmental bronchi is seen. CA lung is considered.','History: A 59 year old woman with history of lung cancer S/P left lower lobectomy with close to pleural margin and left adrenal nodule , was sent for evaluation before post operative RT. Findings: Comparison is made to the prior study on 03/02/2009. Chest: The study reveals evidence of left lower lobectomy with compensatory hyperinflation of the LUL.']}
df2 = pd.DataFrame(data=d)
I want to implement Latent Diritchlet allocation (LDA) for context generation for each sentence. I have separately trained my model for it and want to test on these data.
To reach to LDA, I tokenize the text into sentences as I am interested to classify each sentence with a topic. After sentence tokenization, I implement TFIDF and then to LDA. While reaching upto LDA, I get this error. Following is my code.
df2["sent_token"] = df2["TEXT"].apply(nltk.sent_tokenize)
vectoriser = TfidfVectorizer(tokenizer=identity_tokenizer,stop_words='english',lowercase=False)
df2['tfidf1'] = vectoriser.fit_transform(df2['sent_token'])
lda = LatentDirichletAllocation(n_components =5)
df2['tfidf_lda']= lda.fit_transform(df2['tfidf1'])
Here is where I get this error "ValueError: setting an array element with a sequence." While going through similar errors, ValueError: setting an array element with a sequence I found it may be because the rows have a different number of sentences resulting in different length or sequences. But this is the heterogeneity I have and I am not really sure what is the problem. Please help!!