0

I'm quite new to programming and I am struggling with the scale of a dataset in an otherwise simple problem. I am trying to merge two datasets, one consisting of all outgoing messages and their senders (indexed by "message ids"), and another of all incoming messages and their recipients (indexed by "receipt ids").

Messages are often sent to multiple recipients, which is why there is a discrepancy in the datasets, but it is otherwise exhaustive (each sent message in the outgoing dataset can be found as one or more entries in the incoming dataset), and all messages in the recipient dataset have the corresponding "message id" listed.

All I am trying to do is add an entry to the incoming dataset which indicates the sender of that incoming e-mail. Then I would be able to construct a matrix of e-mail traffic between any two individuals.

In principle it should be straightforward. The datasets of interest are 'recipientinfo' (which lists incoming messages) and 'messages' (which lists outgoing messages). The latter is ordered by 'mid' (message id).

merged<-data.table(recipientinfo,Sender=NA) #appending a dummy column to the recipient dataset to be filled in

m<-nrow(merged)

for (i in 1:m){

  t<-merged[i,mid] #find the message id (mid) for an incoming message
  merged[i,"Sender"]<-as.character(message[t,"sender"]) #slot in the sender entry from the corresponding row in outgoing messages

}

However, the dataset is quite large (there are about 2 million incoming messages, corresponding to ~250k unique outgoing messages). So obviously the loop is completely unfeasible.

Could someone please point me in the right direction for handling this kind of problem?

Sue Doh Nimh
  • 165
  • 2
  • 7
  • 2
    Have you tried the `merge()` command? It would help to provide some sort of [reproducible example](http://stackoverflow.com/questions/5963269/how-to-make-a-great-r-reproducible-example). – MrFlick Nov 10 '15 at 23:18
  • take a look at dplyr and join (http://www.inside-r.org/node/230646). – MLavoie Nov 10 '15 at 23:18
  • I did try data table but it didn't actually improve things too much - however I will look into the other suggestions thank you! – Sue Doh Nimh Nov 10 '15 at 23:21
  • You could check out [multidplyr](https://github.com/hadley/multidplyr). Dplyr in parallel. Maybe helpful with your problem. – phiver Nov 11 '15 at 09:22
  • Thanks guys - solved! – Sue Doh Nimh Nov 11 '15 at 21:11

0 Answers0