2

I have a dataset with a lot of entries. Each of these entries belongs to a certain ID (belongID), the entries are unique (with uniqID), but multiple entries can come from the same source (sourceID). It is also possible that multiple entries from the same source have a the same belongID. For the purposes of the research I need to do on the dataset I have to get rid of the entries of a single sourceID that occur more than 5 times for 1 belongID. The maximum of 5 entries that need to be kept are the ones with the highest 'Time' value.

To illustrate this I have the following example dataset:

   belongID   sourceID uniqID   Time     
   1           1001     101       5            
   1           1002     102       5        
   1           1001     103       4        
   1           1001     104       3       
   1           1001     105       3     
   1           1005     106       2        
   1           1001     107       2       
   1           1001     108       2       
   2           1005     109       5                
   2           1006     110       5        
   2           1005     111       5        
   2           1006     112       5        
   2           1005     113       5      
   2           1006     114       4        
   2           1005     115       4        
   2           1006     116       3       
   2           1005     117       3                
   2           1006     118       3       
   2           1005     119       2        
   2           1006     120       2        
   2           1005     121       1      
   2           1007     122       1        
   3           1010     123       5        
   3           1480     124       2  

The example in the end should look like this:

   belongID   sourceID uniqID   Time     
   1           1001     101       5            
   1           1002     102       5        
   1           1001     103       4        
   1           1001     104       3       
   1           1001     105       3     
   1           1005     106       2        
   1           1001     107       2           
   2           1005     109       5                
   2           1006     110       5        
   2           1005     111       5        
   2           1006     112       5        
   2           1005     113       5      
   2           1006     114       4        
   2           1005     115       4        
   2           1006     116       3       
   2           1005     117       3                
   2           1006     118       3           
   2           1007     122       1        
   3           1010     123       5        
   3           1480     124       2     

There are a lot more columns with data entries in the file, but the selection has to be purely based on time. As shown in the example it can also occur that the 5th and 6th entry of a sourceID with the same belongID have the same time. In this case only 1 has to be chosen, because max=5.

The dataset here is nicely ordered on belongID and time for illustrative purposes, but in the real dataset this is not the case. Any idea how to tackle this problem? I have not come across something similar yet..

Matt Dowle
  • 58,872
  • 22
  • 166
  • 224
Max van der Heijden
  • 1,095
  • 1
  • 8
  • 16

5 Answers5

1

Say your data is in df. The ordered (by uniqID) output is obtained after this:

tab <- tapply(df$Time, list(df$belongID, df$sourceID), length)
bIDs <- rownames(tab)
sIDs <- colnames(tab)
for(i in bIDs)
{
    if(all(is.na(tab[bIDs == i, ])))next
    ids <- na.omit(sIDs[tab[i, sIDs] > 5])
    for(j in ids)
    {
        cond <- df$belongID == i & df$sourceID == j
        old <- df[cond,]
        id5 <- order(old$Time, decreasing = TRUE)[1:5]
        new <- old[id5,]
        df <- df[!cond,]
        df <- rbind(df, new)
    }
}
df[order(df$uniqID), ]
danas.zuokas
  • 4,551
  • 4
  • 29
  • 39
1

A solution in two lines using the plyr package:

library(plyr)
x <- ddply(dat, .(belongID, sourceID), function(x)tail(x[order(x$Time), ], 5))
xx <- x[order(x$belongID, x$uniqID), ]

The results:

   belongID sourceID uniqID Time
5         1     1001    101    5
6         1     1002    102    5
4         1     1001    103    4
2         1     1001    104    3
3         1     1001    105    3
7         1     1005    106    2
1         1     1001    108    2
10        2     1005    109    5
16        2     1006    110    5
11        2     1005    111    5
17        2     1006    112    5
12        2     1005    113    5
15        2     1006    114    4
9         2     1005    115    4
13        2     1006    116    3
8         2     1005    117    3
14        2     1006    118    3
18        2     1007    122    1
19        3     1010    123    5
20        3     1480    124    2
Andrie
  • 176,377
  • 47
  • 447
  • 496
  • When using this method my output (same as the results you present) is the same as the input. The constraint of the maximum of the 5 most recent entries of a sourceID per belongID is not applied? – Max van der Heijden May 29 '12 at 12:07
  • 1
    @MaxvanderHeijden My mistake. Code fixed. – Andrie May 29 '12 at 12:13
1

if dat is your dataframe:

do.call(rbind, 
        by(dat, INDICES=list(dat$belongID, dat$sourceID), 
           FUN=function(x) head(x[order(x$Time, decreasing=TRUE), ], 5)))
Matthew Plourde
  • 43,932
  • 7
  • 96
  • 113
  • Nice and effectively written solution. Works perfectly on my example dataset. I will not try all methods on my complete dataset and do a little benchmark after which I will choose the fastest option as the accepted solution. Will report back asap! – Max van der Heijden May 29 '12 at 12:27
0

The dataset on which this method is going to be used has 170.000+ entries and almost 30 columns

Benchmarking each of the three provided solutions by danas.zuokas, mplourde and Andrie with the use of my dataset, provided the following outcomes:

danas.zuokas' solution:

   User     System  Elapsed 
   2829.569   0     2827.86

mplourde's solution:

   User     System  Elapsed 
   765.628  0.000   763.908

Aurdie's solution:

   User     System  Elapsed 
   984.989  0.000   984.010

Therefore I will use mplourde's solution. Thank you all!

Max van der Heijden
  • 1,095
  • 1
  • 8
  • 16
0

This should be faster, using data.table :

DT = as.data.table(dat)

DT[, .SD[tail(order(Time),5)], by=list(belongID, sourceID)]

Aside : suggest to count the number of times the same variable name is repeated in the various answers to this question. Do you ever have a lot of long or similar object names?

Matt Dowle
  • 58,872
  • 22
  • 166
  • 224