1

I have two dataframes,

word_table <- word_9 word_1 word_3 ...word_random word_2 na na ...word_random word_5 word_3 na ...word_random

dictionary_words <- word_2 word_3 word_4 word_6 word_7 word_8 word_9 . . . word_n what I am looking for, matching the word_table with the dictionary_words and replacing the words with the word position available in the dictionary, like this,

result <- 7 na 2 ... 1 na na ... na 2 na ...

I have tried pmatch, charmatch, match functions, that returning result right way when the dictionary_words are in smaller length, but when it is relatively long like more than 20000 words, the result is coming only for first column, and rest of the columns are just becoming na like this.

result <- 7 na na ... 1 na na ... na na na ...

is there any other way I can do character matching, like using any apply function?

sample

word_table <- data.frame(word_1 <- c("conflict","", "resolved", "", "", ""), word_2 <- c("", "one", "tricky", "one", "", "one"), 
                 word_3 <- c("thanks","", "", "comments", "par",""),word_4 <- c("thanks","", "", "comments", "par",""), word_5 <- c("", "one", "tricky", "one", "", "one"), stringsAsFactors = FALSE)
colnames(word_table) <- c("word_1", "word_2", "word_3", "word_4", "word_5")
## Targeted Words
dictionary_words <- data.frame(cbind(c("abovementioned","abundant","conflict", "thanks", "tricky", "one", "two", "three","four", "resolved")))

## convert into matrix (if needed)
word_table <- as.matrix(word_table)
dictionary_words <- as.matrix(dictionary_words)

## pmatch for each of the element in the dataframe (dt)
# matched_table <- pmatch(dt, TargetWord)
# dim(matched_table) <- dim(dt)
# print(matched_table) 

result <- `dim<-`(pmatch(word_table, dictionary_words, duplicates.ok=TRUE), dim(word_table))
print(result) # working fine, but when the dictionary_words is large, returning result for only first column of the word_table
NewR
  • 11
  • 3
  • welcome! it's a good idea to post your question along with a [reproducible example](http://stackoverflow.com/questions/5963269/how-to-make-a-great-r-reproducible-example/5963610#5963610) – Vincent Bonhomme Apr 16 '16 at 03:59
  • Can you show your code? Have you tried `"dim<-"(match(as.matrix(word_table), dictionary_words[,1]), dim(word_table))` – akrun Apr 16 '16 at 04:01
  • thanks vincent, it is actually hard for me to show a reproducible example, because as I have mentioned, when I am working with relatively small dataframe, it is working perfectly. but when working with large dataframe, it is returning only first column result. please find a sample as I have edited. – NewR Apr 16 '16 at 04:10
  • You don't need `data.frame(cbind` , just `data.frame(V1 = c(...` is enough. Also it is better to use `stringsAsFactors=FALSE` to avoid the column to be converted to `factor` – akrun Apr 16 '16 at 04:16
  • Can you post the `str` of the original dataset. – akrun Apr 16 '16 at 04:29
  • Thanks akrun, I have tried that as well, probably I had read your answer in another post. the method is working fine when dataframe is small, `dictionary_words` especially smaller. for large `dictionary_words` only returning result for first column. :( – NewR Apr 16 '16 at 04:31
  • `str(small_word_table) chr [1:6, 1:5] "conflict" "" "resolved" "" "" "" "" "one" ... - attr(*, "dimnames")=List of 2 ..$ : NULL ..$ : chr [1:5] "word_1" "word_2" "word_3" "word_4" ...` > str(large_word_table) chr [1:79, 1:50] "conflict" "" "thanks" "" "" "conflict" ... - attr(*, "dimnames")=List of 2 ..$ : NULL ..$ : chr [1:50] "aaa_first" "aaa_2" "aaa_3" "aaa_4" ... – NewR Apr 16 '16 at 04:38
  • `str(TargetWord_small_word_dictionary) chr [1:10, 1] "abovementioned" "abundant" "conflict" "thanks" ... - attr(*, "dimnames")=List of 2 ..$ : NULL ..$ : chr "cbind.c..abovementioned....abundant....conflict....thanks....tricky..."` > str(large_word_dictionary) chr [1:13901, 1] "abba" "ability" "abovementioned" "absolute" ... - attr(*, "dimnames")=List of 2 ..$ : NULL ..$ : chr "TargetWord" – NewR Apr 16 '16 at 04:40
  • both are same, just size is different. I guess the character match has some limitations – NewR Apr 16 '16 at 04:45
  • https://gist.github.com/bipul-mohanto/9b6a960955419f8cb689cf2c32edcff1 please find the file here – NewR Apr 16 '16 at 06:38

1 Answers1

0

Here is a reproducible example:

 word_table <- structure(list(V1 = structure(c(3L, 1L, 2L), .Label = c("word_2", 
                                                    "word_5", "word_9"), class = "factor"), V2 = structure(c(1L, 
                                                                                                             NA, 2L), .Label = c("word_1", "word_3"), class = "factor"), V3 = structure(c(1L, 
                                                                                                                                                                                          NA, NA), .Label = "word_3", class = "factor"), V4 = structure(c(1L, 
                                                                                                                                                                                                                                                          1L, 1L), .Label = "...word_random", class = "factor")), .Names = c("V1", 
                                                                                                                                                                                                                                                                                                                             "V2", "V3", "V4"), class = "data.frame", row.names = c(NA, -3L
                                                                                                                                                                                                                                                                                                                             ))

 dictionary_words <- structure(list(V1 = structure(1:7, .Label = c("word_2", "word_3", 
                                                              "word_4", "word_6", "word_7", "word_8", "word_9"), class = "factor")), .Names = "V1", class = "data.frame", row.names = c(NA, 
                                                                                                                                                                                        -7L))

You can use sapply :

> sapply(word_table, function(x) match(x, dictionary_words[, 1]))
     V1 V2 V3 V4
[1,]  7 NA  2 NA
[2,]  1 NA NA NA
[3,] NA  2 NA NA

or apply if you prefer:

> apply(word_table, 2, function(x) match(x, dictionary_words[, 1]))
V1 V2 V3 V4
[1,]  7 NA  2 NA
[2,]  1 NA NA NA
[3,] NA  2 NA NA
Vincent Bonhomme
  • 7,235
  • 2
  • 27
  • 38
  • Thanks vincent once more, Perfectly working on the sample dataframe I mentioned above. but same thing happening when `word_table` is `79x50` and `dictionary_words` is `20000x1`, the result is just coming for first column, rest all are becoming `NA` – NewR Apr 16 '16 at 04:25
  • could you paste the result of `dput(word_table)` and `dput(dictionnary_words)` somewhere, eg in a [gist](https://gist.github.com/) ? – Vincent Bonhomme Apr 16 '16 at 04:54
  • hello vincent, sorry...i am new here, so little slow. hope u will find the files in github, https://gist.github.com/bipul-mohanto/9b6a960955419f8cb689cf2c32edcff1 – NewR Apr 16 '16 at 06:37
  • it worked fine but many of the words in `word_table` are not present in `dictionnary_words`, try the following: `unique(word_table[, 1]) %in% dictionary_words` ! – Vincent Bonhomme Apr 16 '16 at 06:55
  • also, most words have space(s) in them, `word_table <- gsub(" *", "", word_table)` should help. – Vincent Bonhomme Apr 16 '16 at 07:14
  • Vincent, you are a true life saver. Problem solved. I do not have enough point to give u up vote, but I really appreciate your help. God bless you – NewR Apr 16 '16 at 10:02