1

Basically I have my bag of words:

source <- VectorSource(text)
corpus <- Corpus(source)
corpus <- tm_map(corpus, content_transformer(tolower))
dtm <- DocumentTermMatrix(cleanset)

etc etc.

And I have a data frame consisting or just two columns which I called up from a SQLite DB. Column 1 is a list of hundreds of words, and Column 2 is each word's corresponding Part of Speech code.

I am trying to match every token in my dtm to the identical term in column 1 of the dataframe, so that each token then can be matched its corresponding POS code. Essentially, the dataframe is like a dictionary, and I want to match each token in my dtm to its definition.

I tried a bunch of GREP functions to do this, but to no avail. Anyone have thoughts on the best way to approach this?

Thanks!

smci
  • 32,567
  • 20
  • 113
  • 146
ALW94
  • 23
  • 2
  • 3
    Welcome to SO! [A reproducible example](http://stackoverflow.com/questions/5963269/how-to-make-a-great-r-reproducible-example/5963610#5963610) will help you get an answer. Here, it sounds like all you need to do is subset the POS df by `dtm`, but check out the `tidytext` package, and if you don't mind converting `dtm` to a data.frame, `dplyr` joins. – alistaire May 24 '16 at 01:56
  • Thanks! My problem is I'm not working with english (Latin and Greek), so the the built in POS taggers built into various packages aren't so helpful. At this point I have my two dfs: '#Create dataframe from two columns of my sqlite table ##set up driver and call sqlitedb x <- 'SELECT DISTINCT token, code FROM Lexicon' df2 <-data.frame(dbGetQuery(connection, x)) #turn my dtm into a dataframe df1 <- data.frame(words = unlist(stri_extract_all_words(stri_trans_tolower(cleanset1))))` – ALW94 May 24 '16 at 03:35
  • I tried merging on the 'token' columns of both dataframes `colnames(df1)[1] <- "token" colnames(df2)[1] <- "token" merge(df1, df2, by="token")` But unfortunately this did not seem to work – ALW94 May 24 '16 at 03:46
  • What is the format of the final result that you are looking for? – Ken Benoit May 24 '16 at 13:27
  • Hi Ken, ideally a new data frame in which column 1 is every token from my dtm, and column 2 is their corresponding POS code – ALW94 May 24 '16 at 17:17
  • @KenBenoit My next thought was creating a dictionary which matched the tokens in column 1 of my lexicon to their corresponding POS code in column 2, and then applying that dictionary to my DTM – ALW94 May 24 '16 at 18:41

1 Answers1

3

Try the lookup function in the qdap package.

library(qdap)

#create lookup table
words <- c("dog","cat","a", "the","run")
pos <- c("noun","noun","article","article","verb")
random <- c(3,1,2,5,4,1)
df <- data.frame(words, random, pos)

#create doc-term matrix
terms<- c("human", "help","dog","cat","frog", "hello","a","party","run","cheers")
freq <- c(1,2,0,2,3,0,1,4,1,0)
dtm <- data.frame(terms, freq)

#append matches
lookup(dtm$terms, data.frame(df$words,df$pos), missing=NA)
andrea
  • 117
  • 10