1

I have a dataframe which contains 10137 rows (dataframe named phrases) with text and another data frame which contains 62000 terms (dataframe named words) which I would like to use in the first dataframe in order to find in with text of the first data frame the word of the second refers using 0 or 1 if it is not exist or exist respectively.

This snippet of code makes this process:

# Create some fake data
words <- c("stock", "revenue", "continuous improvement")
phrases <- c("blah blah stock and revenue", "yada yada revenue yada", 
             "continuous improvement is an unrealistic goal", 
             "phrase with no match")

# Apply the 'grepl' function along the list of words, and convert the result to numeric
df <- data.frame(lapply(words, function(word) {as.numeric(grepl(word, phrases))}))
# Name the columns the words that were searched
names(df) <- words

However the problem if I use my initial data as decsribed at the first lines is that it will take a long time. I try to find an efficient way in order to make the process faster. I though to give part in order to make it example (based on the volume of my dataframes)

 df_500 <- data.frame(lapply(words, function(word) {as.numeric(grepl(word, phrases[1:500]))}))
 df_1000 <- data.frame(lapply(words, function(word) {as.numeric(grepl(word, phrases[501:1000]))}))
 df_500 <- data.frame(lapply(words, function(word) {as.numeric(grepl(word, phrases[1:500]))}))
 df_1500 <- data.frame(lapply(words, function(word) {as.numeric(grepl(word, phrases[1001:1500]))}))

#and list the dataframe until 10137 like the rows of the first dataframe and after that merge the results into a dataframe.

How can I make this in parallel, because as it is now it will execute the command the one after other and the time will be the same? Is this the right solution to make it?

user8831872
  • 383
  • 1
  • 14

1 Answers1

4

You can use the tm package and create a document term matrix and use a tokeniser from RWeka.

library(tm)
library(RWeka)

First, create the bigram tokeniser:

bigram_tokeniser <- function(x) NGramTokenizer(x, Weka_control(min = 1, max = 2))

Then create a corpus from phrases:

corpus <- VCorpus(VectorSource(phrases)) 

In this case, only words in the vector words will be considered, you can change that by changing the control:

dtm <- DocumentTermMatrix(corpus, 
                          control = list(tokenize = bigram_tokeniser,
                                         dictionary = words))

You can then convert the document term matrix to a matrix and get the desired output:

as.matrix(dtm)

    Terms
Docs continuous improvement revenue stock
   1                      0       1     1
   2                      0       1     0
   3                      1       0     0
   4                      0       0     0
clemens
  • 6,653
  • 2
  • 19
  • 31
  • Thank you vey much. I test this now and I will come back when the process ends. It seems efficient to me. Just a clariffication: corpus is my first dataframe, right? If yes, I have converted it. – user8831872 Nov 08 '17 at 08:01
  • thank you it seems to working. I think I have to detect the length of phrase I have in words frame because tokenizer maybe need for example a length of 4. Thank you again it is quick process – user8831872 Nov 08 '17 at 08:20
  • You should consider using a combination of the packages `text2vec` and `tokenizers` if you need an even faster solution. – Manuel Bickel Nov 08 '17 at 10:00
  • thank you I found how to count. Just a question because I don't know many things for text analysis in advance level. In my initial dataframe of words there are words with 2 phrases/tokes length example (my text). I didn't changed anything for your code expect that I used my real data but the code runs without error however the result is one row with 62000 columns which are my term. Is there anything I made wrong? – user8831872 Nov 08 '17 at 11:29
  • Please edit your question and post the structure of your input data, otherwise it is difficult to provide support. For this purpose, apply the function `str()` to your input data. – Manuel Bickel Nov 08 '17 at 12:50