I have a long-format balanced data frame (df1) that has 7 columns:
df1 <- structure(list(Product_ID = c(1, 1, 1, 1, 1, 2, 2, 2, 2, 2, 3,
3, 3, 3, 3), Product_Category = structure(c(1L, 1L, 1L, 1L, 1L,
2L, 2L, 2L, 2L, 2L, 1L, 1L, 1L, 1L, 1L), .Label = c("A", "B"), class = "factor"),
Manufacture_Date = c(1950, 1950, 1950, 1950, 1950, 1960,
1960, 1960, 1960, 1960, 1940, 1940, 1940, 1940, 1940), Control_Date = c(1961L,
1962L, 1963L, 1964L, 1965L, 1961L, 1962L, 1963L, 1964L, 1965L,
1961L, 1962L, 1963L, 1964L, 1965L), Country_Code = structure(c(1L,
1L, 1L, 1L, 1L, 2L, 2L, 2L, 2L, 2L, 3L, 3L, 3L, 3L, 3L), .Label = c("ABC",
"DEF", "GHI"), class = "factor"), Var1 = c(NA, NA, NA, NA,
NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA), Var2 = c(NA,
NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA)), row.names = c(NA,
15L), class = "data.frame")
Each Product_ID in this data set is linked with a unique Product_Category and Country_Code and Manufacture_Date, and is followed over time (Control_Date). Product_Category has two possible values (A or B); Country_Code and Manufacture_Date have 190 and 90 unique values, respectively. There are 400,000 unique Product_ID's, that are followed over a period of 50 years (Control_Date from 1961 to 2010). This means that df1 has 20,000,000 rows. The last two columns of this data frame are NA at the beginning and have to be filled using the data available in another data frame (df2):
df2 <- structure(list(Product_ID = 1:6, Product_Category = structure(c(1L,
2L, 1L, 1L, 1L, 2L), .Label = c("A", "B"), class = "factor"),
Manufacture_Date = c(1950, 1960, 1940, 1950, 1940, 2000),
Country_Code = structure(c(1L, 2L, 3L, 1L, 2L, 3L), .Label = c("ABC",
"DEF", "GHI"), class = "factor"), Year_1961 = c(5, NA, 10,
NA, 6, NA), Year_1962 = c(NA, NA, 4, 5, 3, NA), Year_1963 = c(8,
6, NA, 5, 6, NA), Year_1964 = c(NA, NA, 9, NA, 10, NA), Year_1965 = c(6,
NA, 7, 4, NA, NA)), row.names = c(NA, 6L), class = "data.frame")
This second data frame contains another type of information on the exact same 400,000 products, in wide-format. Each row represents a unique product (Product_ID) accompanied by its Product_Category, Manufacture_Date and Country_Code. There are 50 other columns (for each year from 1961 to 2010) that contain a measured value (or NA) for each product in each of those years.
Now what I would like to do is to fill in the Var1 & Var2 columns in the first data frame, by doing some calculation on the data available in the second data frame. More precisely, for each row in the first data frame (i.e. a product at Control_Date "t"), the last two columns are defined as follows:
Var1: total number of products in df2 with the same Product_Category, Manufacture_Date and Country_Code that have non-NA value in Year_t;
Var2: total number of products in df2 with different Product_Category but the same Manufacture_Date and Country_Code that have non-NA value in Year_t.
My initial solution with nested for-loops is as follows:
for (i in unique(df1$Product_ID)){
Category <- unique(df1[which(df1$Product_ID==i),"Product_Category"])
Opposite_Category <- ifelse(Category=="A","B","A")
Manufacture <- unique(df1[which(df1$Product_ID==i),"Manufacture_Date"])
Country <- unique(df1[which(df1$Product_ID==i),"Country_Code"])
ID_Similar_Product <- df2[which(df2$Product_Category==Category & df2$Manufacture_Date==Manufacture & df2$Country_Code==Country),"Product_ID"]
ID_Quasi_Similar_Product <- df2[which(df2$Product_Category==Opposite_Category & df2$Manufacture_Date==Manufacture & df2$Country_Code==Country),"Product_ID"]
for (j in unique(df1$Control_Date)){
df1[which(df1$Product_ID==i & df1$Control_Date==j),"Var1"] <- length(which(!is.na(df2[which(df2$Product_ID %in% ID_Similar_Product),paste0("Year_",j)])))
df1[which(df1$Product_ID==i & df1$Control_Date==j),"Var2"] <- length(which(!is.na(df2[which(df2$Product_ID %in% ID_Quasi_Similar_Product),paste0("Year_",j)])))
}
}
The problem with this approach is that it takes a lot of time to be run. So I would like to know if anybody could suggest a vectorized version that would do the job in less time.