Find the number of occurrences of a set of two columns of a data frame in other data frames in r

Question

I have 103 data frames with 7 variables and more than 1000 rows. I want to find the number of occurrences of a pair of two columns of one data frame in other 102 data frames. In other words, how many times c(V1,V2) together (=two columns of a data frame together) can be seen in other 102 data frames.

I've already written a code, but it is very slow!

I put all 103 data frames in a list and convert it to a data frame. Then make a for loop to read each data frame one by one. and in each loop i have another for loop to search for each row of the data frame in that list!

The main part of the codes is as follows:

    for(i in file){
         input<-read.table(i)

         for(j in 1:1000){
            df1<- data.table(input[j,c(1,3)]) 
            count<-merge(df1,dt, c("V1", "V3")) //dt is a data frame includes all 103 data frames
            df1["count"]<-nrow(count)
       }
    }

In this way, I can count how many times set of V1 and V3 of a data frame, comes in other data frames. But obtaining the whole results needs more than 50 days!

I wonder if anyone can help me with a faster way to obtain my desired results.

Example of the data frames (just 5 variable are considered here):

 V1    V2  V3   V4  V5 
 1     Q0  abc  34  3
 1     Q0  abd  31  9
 1     Q0  bac  32  3
 1     Q0  cba  56  0
 2     Q0  zxc  37  3
 2     Q0  fgc  30  3
 2     Q0  ghc  36  3

In fact, I want to find out how many times each value of V3 comes in other data frames but because V3 and V1 are dependent. I must consider V1 in my search as well. So, I have to see how many times c(V1,V3) comes in other data frames. For example (1,abc) together! or (1, abd).

dt has the same structure as the data frames but it includes all data from all data frames that I have!

Could you post a reproducible example with, for instance, two data frames? In particular, it would be helpful to see the structure of `dt`. — josliber, Jul 11 '14 at 06:19
Can you simulate some data and show us expected result? If you're new to R, here's how you can do it: http://stackoverflow.com/questions/5963269/how-to-make-a-great-r-reproducible-example — Roman Luštrik, Jul 11 '14 at 07:40
I'm afraid I cannot make it! But I can explain more. I have 103 data frames same as the above example. I want to count how many times V1 and V3 (together, V1 and V3 should be in one row) comes in these 103 data frames. — MASUMEH, Jul 11 '14 at 07:53
You can adapt the solution [here](http://stackoverflow.com/questions/24641437/logicals-of-shared-rows-of-two-data-frames/24641776#24641776) to work with a subset of columns. Instead of using `a` and `b`, you can use `a[,1:3]` and `b[,1:3]` — MrFlick, Jul 11 '14 at 13:24

Nikos · Answer 1 · 2014-12-18T11:40:31.303

I will attempt an answer but quite frankly I am not sure I have understood your problem. You also don't give enought data for us to work on so it is difficult to find a solution to your problem. However, here it goes. I have commented out the lines which might be problematic and used some of my own. I will be glad to help further if this will get you closer.

V=vector("list",length(file))
cnt=1;
for(i in file){
     #input<-read.table(i)

     # Use fread to read the file. It is vert fast
     dt<-fread(i)[,c(1,3), with=FALSE]
     # Create a dummy column which we will sum eventually
     dt[,VAL:=1] #
     #dt<-merge(dt,df1, by=c('V1','V3'),all.x=TRUE)

     # Add in the list-vector to create the big data.table in the end
     V[[cnt]]=dt;
     cnt=cnt+1

 # You don't need a for-loop to merge line by line.
 #for(j in 1:1000){
      #df1<- data.table(input[j,c(1,3)]) 
      #count<-merge(df1,dt, c("V1", "V3")) //dt is a data frame includes all 103 data frames
      #df1["count"]<-nrow(count)
 #}
}

# Create a big data.table
V<-rbindlist(V);

#Aggregate on V1 and V3 and see how many lines are there.
V[,lapply(.SD,sum,na.rm=TRUE),by=c('V1','V3')]

I hope this helps. Otherwise, if you somehow would upload a file sample that would make things easier.

Thanks

Find the number of occurrences of a set of two columns of a data frame in other data frames in r

1 Answers1