Good evening Overflowers, I have an interesting dilemma that I need help with. (Please bear with me, this is super hard to explain)
I am doing some text mining, I have my corpus created, cleaned, document-term matrix, etc. using the tm lib, all is good and everything is working brilliantly (all the way up to where I am now). This is what I would like to do:
- Use my most common words, or three-word "phrases" that I have in my data.frame (these are the most common words and phrases that we have in the data) and i would like to create like a lookup list or a "dictionary" for a lack of better terms that will basically take one of the phrases, look up in another dataset to see if there is a match, and if yes, give me the value/description I have in the second dataset.
example code:
dtm <- TermDocumentMatrix(corpus)#the corpus was created from my raw .csv
#file
m <- as.matrix(dtm)
v <- sort(rowSums(m),decreasing = T)
d <- data.frame(word = names(v),freq=v)
head(d, 20)
wordf <- d[1:20,]
wordf
wordf looks like this from a structure perspective:
word | freq
password | 13788
Let's get into the details of dataset 2. Dataset 2 has 3 columns (a small example below:
rownum | word | category
1 | password| Request-Access-Password_Reset
(sorry for the formatting, it's not working well for me)
What I would like to do is this. Take the word from the wrdf column, search in dataset 2 "word column" and if there is a match, pull back the value listed in the category column, then write everything up into a new dataframe.
Eventually, I would like this to work automatically via machine learning, and training etc. but for now, it will be manual until I have enough data to actually train the algorithm. So overflowers, I hope I was able to explain myself well enough, apologies if not, I know a lot of you hate generic questions like this, without more details, but I hope I am getting my point across. Please help, and +10 kudo points for anyone that can assist.