0

I have two dataframes, first one (dt) contains of all chr and second one (TargetWord) is a dictionary contains chr as well. I have used pmatch to search in dt which words are available in the TargetWord and returning the position from TargetWord. It is working fine when dataframes are small. But problem starts when the dataframes are huge, it is returning the word position for only the first column, rest of the columns are becoming NA.

## Data Table
word_1 <- c("conflict","", "resolved", "", "", "")
word_2 <- c("", "one", "tricky", "one", "", "one")
word_3 <- c("thanks","", "", "comments", "par","")
word_4 <- c("thanks","", "", "comments", "par","")
word_5 <- c("", "one", "tricky", "one", "", "one")
dt <- data.frame(word_1, word_2, word_3,word_4, word_5, stringsAsFactors = FALSE)

## Targeted Words
TargetWord <- data.frame(cbind(c("conflict", "thanks", "tricky", "one", "two", "three")))

## convert into matrix (needed)
dt <- as.matrix(dt)
TargetWord <- as.matrix(TargetWord)

result <- `dim<-`(pmatch(dt, TargetWord, duplicates.ok=TRUE), dim(dt))
print(result)

Returning result,

     [,1] [,2] [,3] [,4] [,5]
[1,]    1   NA    2    2   NA
[2,]   NA    4   NA   NA    4
[3,]   NA    3   NA   NA    3
[4,]   NA    4   NA   NA    4
[5,]   NA   NA   NA   NA   NA
[6,]   NA    4   NA   NA    4

Now after reading two .csv as bellow, result is just for the first column where I want it for all columns like above result. Bellow, dt1 = 79*50 dataframe, and word_dict 13901*1 dataframe.

#################### on big data #####################################
dt1 <- read.csv("C:/Users/Wonderland/Downloads/string_feature.csv", stringsAsFactors = FALSE)
word_dict <- read.csv("C:/Users/Wonderland/Downloads/word_dict.csv", stringsAsFactors = FALSE)

dt1 <- as.matrix(dt1)
word_dict <- as.matrix(word_dict)

result <- `dim<-`(pmatch(dt1, word_dict, duplicates.ok=TRUE), dim(dt1))
print(result)
NewR
  • 11
  • 3
  • Are you getting an error? An unexpected result? If so, how it differs from the expected result? How do you think anybody could help? – nicola Apr 15 '16 at 18:41
  • thanks nicola. I am actually not getting error, what is the unexpected result means, same code when I am running on my actual data, the outcome is only for one column (first column). I want to find the word position from the word dictionary all over the dataframe – NewR Apr 15 '16 at 18:44
  • A good question should provide a reproducible example and a clear description of the problem. You aren't providing neither. What `the outcome is only for one column` is supposed to mean? The other columns are all `NA`s? Or there aren't other columns? Did you try to subset your data, to see if the issue persists? Why don't you share some data? Or at least some info on them (for instance the output of `str` and similar on each object involved), if they are too big. – nicola Apr 15 '16 at 18:49
  • sorry, new in stackoverflow. you see my result for the small dataframe, the word position is coming for each of the column. and for big dataframe it is coming for only the `first column`, `rest of the column is returning NA` – NewR Apr 15 '16 at 18:53
  • if you do not mind, how can I upload such big dataframe here in stackoverflow? – NewR Apr 15 '16 at 18:54
  • How do you know if the result is wrong? What `pmatch(dt1[,2], word_dict, duplicates.ok=TRUE)` produces? – nicola Apr 15 '16 at 18:57
  • It is returning all `NA`, I know the result is wrong, because some words from the `word_dict` are present in the `dt1` – NewR Apr 15 '16 at 19:03
  • what I am actually looking for,search in the `dt1` for all row and all columns, if any word is present in the `word_dict`, returning the word position replacing the word. – NewR Apr 15 '16 at 19:07
  • Are you sure? `R` is telling the contrary and it is usually right. Find manually some element of the second column of `dt1` (i.e. `dt1[,2]`) that appears in `word_dict`. – nicola Apr 15 '16 at 19:10
  • thanks again for your time. I am trying it for long, strange is beside first column, rest of all are becoming `NA`, that should not happened. whatever, I will try my best. – NewR Apr 15 '16 at 19:14

2 Answers2

0

Try with apply:

apply(dt,2,function(x) pmatch(x,TargetWord,duplicates.ok = T))

As you can see, the result is the same but it probably works with huge dataframe

     word_1 word_2 word_3 word_4 word_5
[1,]      1     NA      2      2     NA
[2,]     NA      4     NA     NA      4
[3,]     NA      3     NA     NA      3
[4,]     NA     NA     NA     NA     NA
[5,]     NA     NA     NA     NA     NA
[6,]     NA     NA     NA     NA     NA

I tried with:

word_1 <- rep(c("conflict","", "resolved", "", "", ""),1000)
word_2 <- rep(c("", "one", "tricky", "one", "", "one"),1000)
word_3 <- rep(c("thanks","", "", "comments", "par",""),1000)
word_4 <- rep(c("thanks","", "", "comments", "par",""),1000)
word_5 <- rep(c("", "one", "tricky", "one", "", "one"),1000)

with all the same code and it worked.

0

pmatch currently works olny for sizes up to 100.

pmatch(rep("a", 100), rep("a", 100))
#  [1]   1   2   3   4   5   6   7   8   9  10  11  12  13  14  15  16  17  18
# [19]  19  20  21  22  23  24  25  26  27  28  29  30  31  32  33  34  35  36
# [37]  37  38  39  40  41  42  43  44  45  46  47  48  49  50  51  52  53  54
# [55]  55  56  57  58  59  60  61  62  63  64  65  66  67  68  69  70  71  72
# [73]  73  74  75  76  77  78  79  80  81  82  83  84  85  86  87  88  89  90
# [91]  91  92  93  94  95  96  97  98  99 100

pmatch(rep("a", 101), rep("a", 101))
#  [1]  1 NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA
# [26] NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA
# [51] NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA
# [76] NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA
#[101] NA
GKi
  • 37,245
  • 2
  • 26
  • 48