1

Using the data.table package for R, I would like to be able to pull a specific record based on multiple criteria. In reality, I am dealing with a table with millions or tens of millions of records, and I want to extract the penultimate entries of a given field for records which have terminating 0 entries in that field. I can identify which records are of interest, and I can identify the dates of 0 entry, so with some creative use of lubridate I can create a list of ID and "Date I should be looking for". How can I take this two-column, 100 row list and use it to return the specific 100 values I need from a table of 10 million records?

Here is a simple example of what I am trying to do.

Given the following data table A:

    Name Date Amount
 1:    A    1    100
 2:    A    2    100
 3:    A    3    100
 4:    A    4     99
 5:    A    5     98
 6:    A    6     97
 7:    A    7     96
 8:    A    8     95
 9:    A    9     94
10:    A   10     93
11:    A   11     92
12:    A   12      0
13:    B    2    200
14:    B    3    200
15:    B    4    190
16:    B    5    180
17:    B    6    170
18:    B    7    160
19:    B    8    150
20:    B    9      0
21:    C    2    100
22:    C    3     95
23:    C    4     90
24:    C    5     90
25:    C    6     85
26:    C    7     80
27:    C    8      0

What I would like to do is pull the last non-zero Amount for each record. What I can do is create a table B:

   Name Date
1:    A   11
2:    B    8
3:    C    7

What I want is the Amount in A for each Name & Date in B. The answer should be c(92, 150, 80).

Another approach would be to extract the subset of all records such that the Name and Date field pairs are valid. I can probably concatenate the two and search that way, but in actuality, Name is a long alphanumeric string and Date is converted into POSIX, so that could get ugly.

Furthermore, I may be doing this the wrong way, if there is a simpler way to return the penultimate record by a specific field, that may be all I need, something that would function like: A[Name %in% X, second-to-last record, by = Name] where X is the list of records in which I am interested.

Thank you.

Code

A <- structure(list(Name = c("A", "A", "A", "A", "A", "A", "A", "A", 
"A", "A", "A", "A", "B", "B", "B", "B", "B", "B", "B", "B", "C", 
"C", "C", "C", "C", "C", "C"), Date = c(1L, 2L, 3L, 4L, 5L, 6L, 
7L, 8L, 9L, 10L, 11L, 12L, 2L, 3L, 4L, 5L, 6L, 7L, 8L, 9L, 2L, 
3L, 4L, 5L, 6L, 7L, 8L), Amount = c(100L, 100L, 100L, 99L, 98L, 
97L, 96L, 95L, 94L, 93L, 92L, 0L, 200L, 200L, 190L, 180L, 170L, 
160L, 150L, 0L, 100L, 95L, 90L, 90L, 85L, 80L, 0L)), .Names = c("Name", 
"Date", "Amount"), row.names = c(NA, -27L), class = c("data.table", 
"data.frame"))

B <- data.table(Name = c('A', 'B', 'C'), Date = c(11, 8, 7))
Community
  • 1
  • 1
Avraham
  • 1,655
  • 19
  • 32
  • 3
    `A[Amount!=0, tail(.SD,1), Name]` – Jaap Mar 21 '16 at 19:14
  • 1
    @ProcrastinatusMaximus And good old `unique(A[Amount!=0], by="Name", fromLast=TRUE)` ... I guess there is a dupe, since I've typed that at least three times in the last three weeks. – Frank Mar 21 '16 at 19:29
  • @ProcrastinatusMaximus Thank you, that works beautifully if I pre-extract the thousands of records which I need to look at from the millions of records in the table. If you write that up as an answer, I'll accept it, as it solves my problem 8-). However, would you know the answer to the more general question? Let's say the Amount value I wanted corresponded to various dates, not necessarily the penultimate? Thanks again! – Avraham Mar 21 '16 at 19:32
  • Here's one dupe target (just for the last-row question not the OP's new generalization): http://stackoverflow.com/q/35704769/1191259 – Frank Mar 21 '16 at 19:33
  • 1
    @Avraham If you want to extend your question, I think it would be fine to edit it, since no one's answered yet. Personally, I don't know what you mean by "corresponded to various dates" without more explanation. – Frank Mar 21 '16 at 19:34
  • @Frank I mean let's say I know I want the amounts corresponding to A 5, B 4, and C 6. – Avraham Mar 21 '16 at 19:38
  • 1
    Oh ok. That's different. I would do `mytab = rbindlist(list(list(Name = "A", Date = 5), list("B", 4), list("C", 6)))` and then `A[mytab,on=names(mytab)]` – Frank Mar 21 '16 at 19:43
  • @Frank If it were three, that would be easy. I have about 8 million records to process and pull the 98 pairs of names and dates 8-) If it turns out I have to do that, I'll ask a new question, as the specific one I asked was answered here. Thank you! – Avraham Mar 21 '16 at 19:46

1 Answers1

4

There is no need to create the seperate B data.table. You can just filter out the zero values and then select the last observation. There are several ways of doing this:

# method 1:
A[Amount!=0, tail(.SD,1), by = Name]

# method 2:
A[!!Amount, .SD[.N], by = Name]

# method 3:
A[Amount!=0, lapply(.SD, last), by = Name]

# method 4:
A[Amount!=0][!duplicated(Name, fromLast = TRUE)]

# method 5 (as proposed by @Frank in the comments):
unique(A[Amount!=0], by = "Name", fromLast = TRUE)

# method 6:
A[A[Amount!=0, .I[.N], Name]$V1]

which all give:

   Name Date Amount
1:    A   11     92
2:    B    8    150
3:    C    7     80

Notes:

  1. A[Amount!=0, last(.SD), by = Name] will not work and result in an error message.
  2. Method 6 may seem overcomplicated, but especially on very large datasets the use of .I seems to be the fastest method. For benchmarks, see this answer.

With regard to your second question (as posed in the comments), you could use a join as follows to get the values of specific dates:

B <- data.table(Name = c('A', 'B', 'C'), Date = c(5, 4, 6))

A[B, on = c('Name','Date')]
# or as proposed by @Frank:
A[B, on = names(B)]

which gives you:

   Name Date Amount
1:    A    5     98
2:    B    4    190
3:    C    6     85
Community
  • 1
  • 1
Jaap
  • 81,064
  • 34
  • 182
  • 193