Using the data.table package for R, I would like to be able to pull a specific record based on multiple criteria. In reality, I am dealing with a table with millions or tens of millions of records, and I want to extract the penultimate entries of a given field for records which have terminating 0 entries in that field. I can identify which records are of interest, and I can identify the dates of 0 entry, so with some creative use of lubridate I can create a list of ID and "Date I should be looking for". How can I take this two-column, 100 row list and use it to return the specific 100 values I need from a table of 10 million records?
Here is a simple example of what I am trying to do.
Given the following data table A
:
Name Date Amount
1: A 1 100
2: A 2 100
3: A 3 100
4: A 4 99
5: A 5 98
6: A 6 97
7: A 7 96
8: A 8 95
9: A 9 94
10: A 10 93
11: A 11 92
12: A 12 0
13: B 2 200
14: B 3 200
15: B 4 190
16: B 5 180
17: B 6 170
18: B 7 160
19: B 8 150
20: B 9 0
21: C 2 100
22: C 3 95
23: C 4 90
24: C 5 90
25: C 6 85
26: C 7 80
27: C 8 0
What I would like to do is pull the last non-zero Amount
for each record. What I can do is create a table B
:
Name Date
1: A 11
2: B 8
3: C 7
What I want is the Amount
in A
for each Name
& Date
in B
. The answer should be c(92, 150, 80)
.
Another approach would be to extract the subset of all records such that the Name
and Date
field pairs are valid. I can probably concatenate the two and search that way, but in actuality, Name
is a long alphanumeric string and Date
is converted into POSIX, so that could get ugly.
Furthermore, I may be doing this the wrong way, if there is a simpler way to return the penultimate record by a specific field, that may be all I need, something that would function like: A[Name %in% X, second-to-last record, by = Name]
where X
is the list of records in which I am interested.
Thank you.
Code
A <- structure(list(Name = c("A", "A", "A", "A", "A", "A", "A", "A",
"A", "A", "A", "A", "B", "B", "B", "B", "B", "B", "B", "B", "C",
"C", "C", "C", "C", "C", "C"), Date = c(1L, 2L, 3L, 4L, 5L, 6L,
7L, 8L, 9L, 10L, 11L, 12L, 2L, 3L, 4L, 5L, 6L, 7L, 8L, 9L, 2L,
3L, 4L, 5L, 6L, 7L, 8L), Amount = c(100L, 100L, 100L, 99L, 98L,
97L, 96L, 95L, 94L, 93L, 92L, 0L, 200L, 200L, 190L, 180L, 170L,
160L, 150L, 0L, 100L, 95L, 90L, 90L, 85L, 80L, 0L)), .Names = c("Name",
"Date", "Amount"), row.names = c(NA, -27L), class = c("data.table",
"data.frame"))
B <- data.table(Name = c('A', 'B', 'C'), Date = c(11, 8, 7))