1

Good evening Overflowers, I have an interesting dilemma that I need help with. (Please bear with me, this is super hard to explain)

I am doing some text mining, I have my corpus created, cleaned, document-term matrix, etc. using the tm lib, all is good and everything is working brilliantly (all the way up to where I am now). This is what I would like to do:

  1. Use my most common words, or three-word "phrases" that I have in my data.frame (these are the most common words and phrases that we have in the data) and i would like to create like a lookup list or a "dictionary" for a lack of better terms that will basically take one of the phrases, look up in another dataset to see if there is a match, and if yes, give me the value/description I have in the second dataset.

example code:

dtm <- TermDocumentMatrix(corpus)#the corpus was created from my raw .csv 
#file
m <- as.matrix(dtm)
v <- sort(rowSums(m),decreasing = T)
d <- data.frame(word = names(v),freq=v)
head(d, 20)
wordf <- d[1:20,]
wordf

wordf looks like this from a structure perspective:

word | freq

password | 13788

Let's get into the details of dataset 2. Dataset 2 has 3 columns (a small example below:

rownum | word | category

1 | password| Request-Access-Password_Reset

(sorry for the formatting, it's not working well for me)

What I would like to do is this. Take the word from the wrdf column, search in dataset 2 "word column" and if there is a match, pull back the value listed in the category column, then write everything up into a new dataframe.

Eventually, I would like this to work automatically via machine learning, and training etc. but for now, it will be manual until I have enough data to actually train the algorithm. So overflowers, I hope I was able to explain myself well enough, apologies if not, I know a lot of you hate generic questions like this, without more details, but I hope I am getting my point across. Please help, and +10 kudo points for anyone that can assist.

user1762132
  • 215
  • 1
  • 3
  • 9
  • This sounds like a fairly standard `merge` in base R, or a `left_join` from `dplyr` if I'm understanding correctly. Would be able to give more guidance if you shared a [reproducible](https://stackoverflow.com/questions/5963269/how-to-make-a-great-r-reproducible-example) example (i.e., with `dput`) – Conor Neilson Aug 16 '18 at 03:17
  • @ConorNeilson, Indeed, you are 100% correct... the merge works really well (I have done that already), any ideas on how to put like an "n/a" in the rows that have zero match? and then some kind of code where you can add a "category" in dataset 2? I am trying to stay out of excel as much as possible, as my team will be mainly using this script to do their "day to day". how can we add new values based on what we see in dataset 1 to dataset 2 in the category column? Thanks in advance my friend. – user1762132 Aug 16 '18 at 03:43
  • the N/A will happen automatically if you perform a left join. This can be done sing baseR (merge, all.x=TRUE), dplyr (left_join) or data.table (y[x, nomatch = NA]. With data.table you can also easlily control the result when multiple matches occus for a single row. – Wimpel Aug 16 '18 at 05:46

0 Answers0