In R, I have a seed table that looks like this:
seed_table
|========|================|
| date | classification |
|========|================|
| 201501 | A |
| 201501 | A |
| 201501 | A |
| 201502 | B |
| 201502 | B |
| 201502 | B |
| ... | ... |
And a data table that looks like this
data:
|========|================|===========|================|
| ID | Create_Date | End_Date | classification |
|========|================|===========|================|
| 1 | 201501 | 201601 | A |
| 2 | 201501 | 201605 | B |
| 3 | 201502 | 201601 | B |
| 4 | 201412 | 201501 | A |
| 5 | 201412 | 201502 | B |
| 6 | 201502 | 201503 | A |
| ... | ... | ... | ... |
I am writing the following code to get the number of "active observations" for each month and classification in the seed table. An active observation is an observation whose Created_Date <= month
of the row in the seed table and whose End_Date >= month
of the row in the seed table:
n <- nrow(seed_table)
num_obs <- numeric(n)
for (row in 1:n) {
num_obs[row] <- (sum(
data$Created_Date >= seed_table[row, "date"] &
data$End_Date <= seed_table[row, "date"] &
data$classification == seed_table[row, "classification"]))
cat(n - row)
}
However the code is extremely slow. I have 2054 rows in the seed table (~13 months, 158 classification levels month)
Is there any way to make this performant?