3

I have data as follows:

PERMNO date DLSTCD
    10 1983     NA 
    10 1985    250 
    10 1986     NA
    10 1986     NA 
    10 1987    240 
    10 1987     NA  
    11 1984     NA  
    11 1984     NA  
    11 1985     NA  
    11 1987     NA 
    12 1984    240 

I need to filter rows based on following criteria:

  1. For each PERMNO, sort data by date
  2. Parse through the sorted data and delete rows after a company gets delisted (ie. DLSTCD != NA)
  3. If the first row corresponds to company getting delisted, do not include any rows for that company

Based on these criteria, following is my expected output:

PERMNO date DLSTCD
    10 1983     NA 
    10 1985    250 
    11 1984     NA  
    11 1984     NA  
    11 1985     NA  
    11 1987     NA 

I am using data.table in R to work with this data. The example above is an oversimplified version of my actual data, which contains around 3M rows corresponding to 30k PERMNOs.

I implemented three different methods for doing this, as can be seen here:
r-fiddle: http://www.r-fiddle.org/#/fiddle?id=4GapqSbX&version=3

Below I compare my implementations using a small dataset of 50k rows. Here are my results:

Time Comparison

system.time(dt <- filterbydelistingcode(dt))   # 39.962 seconds
system.time(dt <- filterbydelistcoderowindices(dt))   # 39.014 seconds
system.time(dt <- filterbydelistcodeinline(dt))   # 114.3 seconds

As you can see all my implementations are extremely inefficient. Can someone please help me implement a much faster version for this? Thank you.

Edit: Here is a link to a sample dataset of 50k rows that I used for time comparison: https://ufile.io/q9d8u

Also, here is a customized read function for this data:

readdata = function(filename){
    data = read.csv(filename,header=TRUE, colClasses = c(date = "Date"))
    PRCABS = abs(data$PRC)
    mcap = PRCABS * data$SHROUT
    hpr = data$RET
    HPR = as.numeric(levels(hpr))[hpr]
    HPR[HPR==""] = NA
    data = cbind(data,PRCABS,mcap, HPR)
    return(data)
}

data <- readdata('fewdata.csv')
dt <- as.data.table(data)
DenaG
  • 41
  • 3
  • 1
    Because you have a benchmark-centric question, please provide easily copy-pastable code to generate a test data set of relevant size. Cheers. – Henrik Jun 06 '17 at 00:15
  • @Henrik: Thank you for your comment. I am a new user of R. My concern is I am not using data.table and R the correct way. I'm sure there is a much better implementation for doing this. I'll try to upload the data someplace in the meanwhile. – DenaG Jun 06 '17 at 01:20
  • 1
    See [How to make a great R reproducible example?](https://stackoverflow.com/questions/5963269/how-to-make-a-great-r-reproducible-example) on how to use e.g. `sample` to generate toy data. It's often preferable to provide code to generate the data in the post, rather than to rely on downloading it from external sites. – Henrik Jun 06 '17 at 01:25
  • @Henrik: I have updated the question with data. I have also updated the r-fiddle. – DenaG Jun 06 '17 at 01:31
  • 1
    @Frank: Thank you for your comment. But it is not possible for me to generate a synthetic dataset with the randomness inherent in the actual data. However, the link will allow you to download that data and then I have provided the read functions to read the data. – DenaG Jun 06 '17 at 03:01
  • @Frank: `cbind` is being used in the `readdata` function, which is working for me fast enough to not fiddle further. I am more concerned about the actual problem that I have posted – DenaG Jun 06 '17 at 03:02
  • Ok. Fwiw, thelatemail below shows how to construct an example data set. It may not perfectly mirror your real data, but long-term reproducibility in a question is more important. There's some more guidance here if you're interested: [mcve] – Frank Jun 06 '17 at 13:04

2 Answers2

5

Here's an attempt in data.table:

dat[
  dat[order(date),
  {
    pos <- match(TRUE, !is.na(DLSTCD));
    (.I <= .I[pos] & pos != 1) | (is.na(pos)) 
  },
  by=PERMNO]
$V1]

#   PERMNO date DLSTCD
#1:     10 1983     NA
#2:     10 1985    250
#3:     11 1984     NA
#4:     11 1984     NA
#5:     11 1985     NA
#6:     11 1987     NA

Testing it on 2.5million rows, 400000 with a delisting date:

set.seed(1)
dat <- data.frame(PERMNO=sample(1:22000,2.5e6,replace=TRUE), date=1:2.5e6)
dat$DLSTCD <- NA
dat$DLSTCD[sample(1:2.5e6, 400000)] <- 1
setDT(dat)

system.time({
dat[
  dat[order(date),
  {
    pos <- match(TRUE, !is.na(DLSTCD));
    (.I <= .I[pos] & pos != 1) | (is.na(pos)) 
  },
  by=PERMNO]
$V1]
})
#   user  system elapsed 
#   0.74    0.00    0.76 

Less than a second - not bad.

thelatemail
  • 91,185
  • 12
  • 128
  • 188
  • Thank you. This indeed is blazingly fast. I was able to run it on 3M rows in just 2 seconds as opposed to 40 mins that my implementation was taking. I am however, not able to completely understand the `j` argument in your code. Can you please elaborate that – DenaG Jun 06 '17 at 05:01
5

Building on @thelatemail's answer, here are two more variations on the same theme.

In both cases, setkey() first makes things easier to reason with :

setkey(dat,PERMNO,date)  # sort by PERMNO, then by date within PERMNO

Option 1 : stack the data you want (if any) from each group

system.time(
  ans1 <- dat[, {
    w = first(which(!is.na(DLSTCD)))
    if (!length(w)) .SD
    else if (w>1) .SD[seq_len(w)]
  }, keyby=PERMNO]
)
   user  system elapsed 
  2.604   0.000   2.605 

That's quite slow because allocating and populating all the little bits of memory for the result for each group, only then to be stacked into one single result in the end again, takes time and memory.

Option 2 : (closer to the way you phrased the question) find the row numbers to delete, then delete them.

system.time({
  todelete <- dat[, {
    w = first(which(!is.na(DLSTCD)))
    if (length(w)) .I[seq.int(from=if (w==1) 1 else w+1, to=.N)]
  }, keyby=PERMNO]

  ans2 <- dat[ -todelete$V1 ]
})
   user  system elapsed 
  0.160   0.000   0.159

That's faster because it's only stacking row numbers to delete followed by a single operation to delete the required rows in one bulk operation. Since it's grouping by the first column of the key, it uses the key to make the grouping faster (groups are contiguous in RAM).

More info can be found about ?.SD and ?.I on this manual page.

You can inspect and debug what is happening inside each group just by adding a call to browser() and having a look as follows.

> ans1 <- dat[, {
     browser()
     w = first(which(!is.na(DLSTCD)))
     if (!length(w)) .SD
     else if (w>1) .SD[seq_len(w)]
   }, keyby=PERMNO]
Browse[1]> .SD      # type .SD to look at it
        date DLSTCD
  1:   21679     NA
  2:   46408      1
  3:   68378     NA
  4:   75362     NA
  5:   77690     NA
 ---               
111: 2396559      1
112: 2451629     NA
113: 2461958     NA
114: 2484403     NA
115: 2485217     NA
Browse[1]> w   # doesn't exist yet because browser() before that line
Error: object 'w' not found
Browse[1]> w = first(which(!is.na(DLSTCD)))  # copy and paste line
Browse[1]> w
[1] 2
Browse[1]> if (!length(w)) .SD else if (w>1) .SD[seq_len(w)]
    date DLSTCD
1: 21679     NA
2: 46408      1
Browse[1]> # that is what is returned for this group
Browse[1]> n   # or type n to step to next line
debug at #3: w = first(which(!is.na(DLSTCD)))
Browse[2]> help  # for browser commands

Let's say you find a problem or bug with one particular PERMNO. You can make the call to browser conditional as follows.

> ans1 <- dat[, {
     if (PERMNO==42) browser()
     w = first(which(!is.na(DLSTCD)))
     if (!length(w)) .SD
     else if (w>1) .SD[seq_len(w)]
  }, keyby=PERMNO]
Browse[1]> .SD
        date DLSTCD
  1:   31018     NA
  2:   35803      1
  3:   37494     NA
  4:   50012     NA
  5:   52459     NA
 ---               
128: 2405818     NA
129: 2429995     NA
130: 2455519     NA
131: 2478605      1
132: 2497925     NA
Browse[1]> 
Matt Dowle
  • 58,872
  • 22
  • 166
  • 224
  • 1
    Thanks a lot for your nice answer, not the least the thorough explanation! Do you mind elaborating on why you use `keyby` here, instead of `by`? Isn't the data duly sorted already after the `setkey` step? Cheers. – Henrik Jun 08 '17 at 22:41
  • 2
    I got within 1 order of magnitude of the package author's solution. I give myself a pass grade ;-) – thelatemail Jun 09 '17 at 02:10
  • @Henrik Yes probably `keyby=` is identical to `by=` in this case since it's by the `setkey`. I just wanted to encourage use of `keyby=` by default really. – Matt Dowle Jun 09 '17 at 02:56