0

In R, I have a seed table that looks like this:

seed_table

|========|================|
| date   | classification |
|========|================|
| 201501 | A              |
| 201501 | A              |
| 201501 | A              |
| 201502 | B              |
| 201502 | B              |
| 201502 | B              |
| ...    | ...            |

And a data table that looks like this

data:

|========|================|===========|================|
| ID     | Create_Date    | End_Date  | classification |
|========|================|===========|================|
| 1      | 201501         | 201601    | A              |
| 2      | 201501         | 201605    | B              |
| 3      | 201502         | 201601    | B              |
| 4      | 201412         | 201501    | A              |
| 5      | 201412         | 201502    | B              |
| 6      | 201502         | 201503    | A              |
| ...    | ...            | ...       | ...            |

I am writing the following code to get the number of "active observations" for each month and classification in the seed table. An active observation is an observation whose Created_Date <= month of the row in the seed table and whose End_Date >= month of the row in the seed table:

n <- nrow(seed_table)
num_obs <- numeric(n)
for (row in 1:n) {
    num_obs[row] <- (sum(
        data$Created_Date >= seed_table[row, "date"] &
            data$End_Date <= seed_table[row, "date"] &
            data$classification == seed_table[row, "classification"]))
    cat(n - row)
}  

However the code is extremely slow. I have 2054 rows in the seed table (~13 months, 158 classification levels month)

Is there any way to make this performant?

Alam
  • 1
  • 2
    Please submit a minimal reproducible example. See: [How to create a Minimal, Complete, and Verifiable example](http://stackoverflow.com/help/mcve). Specifically you should share your (minimal reproducible) data using `dput()` – Eric Fail Jan 27 '16 at 18:20
  • 1
    Read about [merge](http://stackoverflow.com/questions/1299871) and aggregate: [R Grouping functions: sapply vs. lapply vs. apply. vs. tapply vs. by vs. aggregate](http://stackoverflow.com/questions/3505701), [Aggregate multiple variables simultaneously](http://stackoverflow.com/questions/9723208), [How to sum a variable by group?](http://stackoverflow.com/q/1660124) – zx8754 Jan 27 '16 at 18:23
  • [How to creat a reproducible example](http://stackoverflow.com/questions/5963269/how-to-make-a-great-r-reproducible-example/5963610) in R – Jaap Jan 27 '16 at 19:05

1 Answers1

0

As @eric-fail suggested, you should use dput() to share your data. For example:

seed_table <- structure(list(
  date = c(201501L, 201501L, 201502L), 
  classification = structure(
    c(1L, 1L, 2L), .Label = c("A", "B"), class = "factor")), 
  .Names = c("date", "classification"), 
  row.names = c(1L, 2L, 4L), class = "data.frame")
data <- structure(list(
  ID = 1:6, 
  Create_Date = c(201501L, 201501L, 201502L, 201412L, 201412L, 201502L), 
  End_Date = c(201601L, 201605L, 201601L, 201501L, 201502L, 201503L), 
  classification = structure(c(1L, 2L, 2L, 1L, 2L, 1L), 
    .Label = c("A", "B"), class = "factor")), 
  .Names = c("ID", "Create_Date", "End_Date", "classification"), 
  class = "data.frame", row.names = c(NA, -6L))

I did not do a speed comparison, but getting rid of the for() loop and using the outer() function instead might speed up your calculations. Give this a try:

m1 <- outer(seed_table$date, data$Create_Date, ">=")
m2 <- outer(seed_table$date, data$End_Date, "<=")
m3 <- outer(seed_table$classification, data$classification, "==")
m <- m1 & m2 & m3
num_obs <- apply(m, 1, sum)

Note that you had some errors in your code. You referred to Created_Date instead of Create_Date, and (I believe) you had your inequalities (>=and <=) reversed.

Jean V. Adams
  • 4,634
  • 2
  • 29
  • 46