1

I am stuck at one problem. I am trying to categorize sentences into topics using LDA. I have done it, however the problem is: LDA is working on whole dataset and giving me topic terminologies across the dataset. I want to get the topic terminologies by group in Dataset.

So my data looks like this:

Comment                                                                                  Division
Smooth execution of Regional Administration in my absence. Well done.                    Finance
Job well done in completing CPs and making the facility available well in time.          Finance
Good Job performed on the successful implementation of Cash IVR.                         Commercial

I run the following code to get the topics

x <- udpipe(x= praise$Feedback.Comments, object= ud_model)

x$topic_level_id <- unique_identifier(x, fields = c("doc_id", "paragraph_id", "sentence_id"))
## Get a data.frame with 1 row per id/lemma
dtf <- subset(x, upos %in% c("NOUN", "ADJ"))
dtf <- document_term_frequencies(dtf, document = "topic_level_id", term = "lemma")
head(dtf)


dtm <- document_term_matrix(x = dtf)
## Remove words which do not occur that much
dtm_clean <- dtm_remove_lowfreq(dtm, minfreq = 5)
head(dtm_colsums(dtm_clean))

#dtm_clean <- dtm_remove_terms(dtm_clean, terms = c("%age", "4G"))
## Or keep of these nouns the top 50 based on mean term-frequency-inverse document frequency
dtm_clean <- dtm_remove_tfidf(dtm_clean, top = 80)

library(topicmodels)
m <- LDA(dtm_clean, k = 4, method = "Gibbs", 
         control = list(nstart = 5, burnin = 2000, best = TRUE, seed = 1:5))

scores <- predict(m, newdata = dtm, type = "topics", 
                  labels = c("labela", "labelb", "labelc", "xyz"))
str(scores)

predict(m, type = "terms", min_posterior = 0.05, min_terms = 3)


dtf <- subset(x, upos %in% c("NOUN", "ADJ"))

dtf <- document_term_frequencies(dtf, document = "topic_level_id", term = "lemma")
dtm <- document_term_matrix(x = dtf)
dtm_clean <- dtm_remove_lowfreq(dtm, minfreq = 3)
## Build topic model + get topic terminology
m <- LDA(dtm_clean, k = 4, method = "Gibbs", 
         control = list(nstart = 5, burnin = 2000, best = TRUE, seed = 1:5))
topicterminology <- predict(m, type = "terms", min_posterior = 0.025, min_terms = 5)
scores <- predict(m, newdata = dtm, type = "topics")

The results I get are as follow

$topic_001
             term       prob
1            work 0.31616890
2        customer 0.13422588
3            time 0.08616547
4            role 0.05526948
5   collaborative 0.03810505
6         service 0.03810505
7           value 0.03810505
8         amazing 0.03123927
9  implementation 0.03123927
10           line 0.02780639

I want to get each one of them by division_name

Results I want

             term       prob        Division
1            work 0.31616890        Finance
2        customer 0.13422588        Finance
3            time 0.08616547        Commercial
4            role 0.05526948        Commercial
5   collaborative 0.03810505        Commercial

Simulation Dataset

structure(list(Feedback.Comments = c("Excellent kick start of p", 
"Nauman is very collaborative when it comes to team deliverable. He takes ownership and ensure to support whenever needed or asked for. ", 
"Thank you for being very collaborative and designing and planning the whole workshop that deemed success today for R", 
"Amazing knowledge sharing session conducted by you. Truly innovative.", 
"Thanks a lot for your collaboration during my training dates", 
"During Prepaid Consolidation Step 1, you have done excellent job in handling the Mediation stream resulting in a smooth delivery.  The highlights of this delivery was the collaboration which was executed excellently.", 
"He handles all the organization customers in a very collaborative manner.", 
"Noor ul Amin is very supportive and initiative hungry person, always take very quick/bold step when ever any issue happened. ", 
"Keeping check on timely rectification of observations by HSSE with good speed.", 
"Smooth execution of Regional Administration in my absence. Well done.", 
"Good Job performed on the successful implementation of Jazz Cash IVR. the 1st selfservice IVR for financial transactions in industry.", 
"Despite challenges on the resource side you have done exceptionally well in managing the UATs, Prepaid Consolidation and assigned tasks.\n\nWe need focus more on FCR & NPS related areas so we are able to meet our KPIs, looking forward for stats and feedback on time. It would be better if we dedicate one resources on this side and not deploy all resources on prepaid consolidation (it will not give us any benefit)", 
"Job well done in reorganizing all the investments to fixed portfolios.\n Keep it up.", 
"Well done in reorganizing PF process and resolving legacy issues.", 
"Job well done in completing CPs and making the facility available well in time.", 
"You always seems supportive on these requests.sometimes you also submitted input in late hours of the day. Keep ot up.", 
"Well done on completing Hiperos screening for almost 30 profiles. Please pass the feedback to Khurram and Babar.", 
"You always make your concerns clear at a judgment. It always good to have a critical view on things, helps avoiding mistakes. Keep it up.", 
"Both FLT in Lahore and Karachi were planned, managed and executed to the perfection under your lead. Wonderful collaboration with P&O and cross functional teams. Good job and good management. ", 
"Very good resource. Always up to the expectations.\nDid good job in back office evaluations"
), division_name = c("People & Organization", "People & Organization", 
"People & Organization", "People & Organization", "People & Organization", 
"Technology", "Finance", "Finance", "People & Organization", 
"People & Organization", "Commercial", "Commercial", "Finance", 
"Finance", "Finance", "Finance", "Finance", "Finance", "People & Organization", 
"Commercial")), row.names = c(1L, 3L, 4L, 5L, 6L, 7L, 8L, 9L, 
10L, 11L, 17L, 21L, 23L, 24L, 25L, 29L, 31L, 32L, 35L, 37L), class = "data.frame")
Rana Usman
  • 1,031
  • 7
  • 21
  • @tushaR As you can see in my data, it is division_wise. What else way can this be trained? would you be able to point/hint? – Rana Usman Nov 20 '19 at 11:07
  • subset you feedback.comments on division column. Pick comments for only 1 unique division and then build the model. – tushaR Nov 20 '19 at 11:50
  • @tushaR I understand this but this won't be scalable, would it be and this is why I am finding a better way – Rana Usman Nov 20 '19 at 11:56
  • Are you saying that the model be built on the complete data and the topic labeling should happen at the division level? – tushaR Nov 20 '19 at 12:21
  • @tushaR Yes, that would work in my humble opinion. – Rana Usman Nov 20 '19 at 12:22
  • What is your exact expected output? Because as I read it, it is 1. training the full model, then 2. using an apply function / grouping function to predict on the sub groups (divisions). – phiver Nov 20 '19 at 14:07
  • @phiver I think that would work and probably this is what I want. I'm unable to hang around it. – Rana Usman Nov 20 '19 at 14:19
  • @phiver I added the expected output.. like this or something of a sort – Rana Usman Nov 20 '19 at 17:26
  • @RanaUsman, I tried your example, but the predict part is not working. Probably not enough data when using the `LDA` function. Can you try to create a reproducible example that works until the predict stage? – phiver Nov 23 '19 at 16:00
  • @phiver you can change this to `dtm_clean <- dtm_remove_lowfreq(dtm, minfreq = 8)` to `dtm_clean <- dtm_remove_lowfreq(dtm, minfreq = 3)` , then it should work – Rana Usman Nov 25 '19 at 10:02
  • @phiver https://datascience.stackexchange.com/questions/63853/topic-modelling-by-category-in-r – Rana Usman Nov 27 '19 at 12:41

0 Answers0