0

I have a long-format balanced data frame (df1) that has 7 columns:

df1 <- structure(list(Product_ID = c(1, 1, 1, 1, 1, 2, 2, 2, 2, 2, 3, 
3, 3, 3, 3), Product_Category = structure(c(1L, 1L, 1L, 1L, 1L, 
2L, 2L, 2L, 2L, 2L, 1L, 1L, 1L, 1L, 1L), .Label = c("A", "B"), class = "factor"), 
    Manufacture_Date = c(1950, 1950, 1950, 1950, 1950, 1960, 
    1960, 1960, 1960, 1960, 1940, 1940, 1940, 1940, 1940), Control_Date = c(1961L, 
    1962L, 1963L, 1964L, 1965L, 1961L, 1962L, 1963L, 1964L, 1965L, 
    1961L, 1962L, 1963L, 1964L, 1965L), Country_Code = structure(c(1L, 
    1L, 1L, 1L, 1L, 2L, 2L, 2L, 2L, 2L, 3L, 3L, 3L, 3L, 3L), .Label = c("ABC", 
    "DEF", "GHI"), class = "factor"), Var1 = c(NA, NA, NA, NA, 
    NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA), Var2 = c(NA, 
    NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA)), row.names = c(NA, 
15L), class = "data.frame")

Each Product_ID in this data set is linked with a unique Product_Category and Country_Code and Manufacture_Date, and is followed over time (Control_Date). Product_Category has two possible values (A or B); Country_Code and Manufacture_Date have 190 and 90 unique values, respectively. There are 400,000 unique Product_ID's, that are followed over a period of 50 years (Control_Date from 1961 to 2010). This means that df1 has 20,000,000 rows. The last two columns of this data frame are NA at the beginning and have to be filled using the data available in another data frame (df2):

df2 <- structure(list(Product_ID = 1:6, Product_Category = structure(c(1L, 
2L, 1L, 1L, 1L, 2L), .Label = c("A", "B"), class = "factor"), 
    Manufacture_Date = c(1950, 1960, 1940, 1950, 1940, 2000), 
    Country_Code = structure(c(1L, 2L, 3L, 1L, 2L, 3L), .Label = c("ABC", 
    "DEF", "GHI"), class = "factor"), Year_1961 = c(5, NA, 10, 
    NA, 6, NA), Year_1962 = c(NA, NA, 4, 5, 3, NA), Year_1963 = c(8, 
    6, NA, 5, 6, NA), Year_1964 = c(NA, NA, 9, NA, 10, NA), Year_1965 = c(6, 
    NA, 7, 4, NA, NA)), row.names = c(NA, 6L), class = "data.frame")

This second data frame contains another type of information on the exact same 400,000 products, in wide-format. Each row represents a unique product (Product_ID) accompanied by its Product_Category, Manufacture_Date and Country_Code. There are 50 other columns (for each year from 1961 to 2010) that contain a measured value (or NA) for each product in each of those years.

Now what I would like to do is to fill in the Var1 & Var2 columns in the first data frame, by doing some calculation on the data available in the second data frame. More precisely, for each row in the first data frame (i.e. a product at Control_Date "t"), the last two columns are defined as follows:

Var1: total number of products in df2 with the same Product_Category, Manufacture_Date and Country_Code that have non-NA value in Year_t;

Var2: total number of products in df2 with different Product_Category but the same Manufacture_Date and Country_Code that have non-NA value in Year_t.

My initial solution with nested for-loops is as follows:

for (i in unique(df1$Product_ID)){

    Category <- unique(df1[which(df1$Product_ID==i),"Product_Category"])
    Opposite_Category <- ifelse(Category=="A","B","A")
    Manufacture <- unique(df1[which(df1$Product_ID==i),"Manufacture_Date"])
    Country <- unique(df1[which(df1$Product_ID==i),"Country_Code"])

    ID_Similar_Product <- df2[which(df2$Product_Category==Category & df2$Manufacture_Date==Manufacture & df2$Country_Code==Country),"Product_ID"]
    ID_Quasi_Similar_Product <- df2[which(df2$Product_Category==Opposite_Category & df2$Manufacture_Date==Manufacture & df2$Country_Code==Country),"Product_ID"]

    for (j in unique(df1$Control_Date)){
        df1[which(df1$Product_ID==i & df1$Control_Date==j),"Var1"] <- length(which(!is.na(df2[which(df2$Product_ID %in% ID_Similar_Product),paste0("Year_",j)])))
        df1[which(df1$Product_ID==i & df1$Control_Date==j),"Var2"] <- length(which(!is.na(df2[which(df2$Product_ID %in% ID_Quasi_Similar_Product),paste0("Year_",j)])))
    }
}

The problem with this approach is that it takes a lot of time to be run. So I would like to know if anybody could suggest a vectorized version that would do the job in less time.

Sohrab
  • 3
  • 2
  • Hey! Please, give us a small sample of your data. Post the output of `dput(yourDataframe[1:20,])`. – Georgery Feb 19 '20 at 09:39
  • please [make your question reproducible](https://stackoverflow.com/questions/5963269/how-to-make-a-great-r-reproducible-example). – Wimpel Feb 19 '20 at 09:39
  • Thanks for your reply Georgery and Wimpel. I just edited my question and added a reproducible example to it. – Sohrab Feb 19 '20 at 10:24
  • Hi Sohrab! You said that the second data frame contains 400,000 records where each row represents a unique product (Product_ID) accompanied by its Product_Category, Manufacture_Date and Country_Code. But the number of unique combinations of Product_Category (2), Manufacture_Date (90) and Country_Code (190) is only 34,200, not 400,000. Can you clarify that? Is there another variable that is increasing the size of the data? – Edward Feb 20 '20 at 05:28
  • Hello @Edward ! Thank you for your comment. The triple (Product_Category, Manufacture_Date, Country_Code) is not necessarily unique in the second data frame. What is uniquely defined by each row of the second data frame is indeed the quadruple (Product_ID, Product_Category, Manufacture_Date, Country_Code), i.e. that two different products (with different ID's) may have exactly the same Product_Category, Manufacture_Date and Country_Code. That is why the number of rows is not restricted to 34200. Hope this clarifies! – Sohrab Feb 21 '20 at 08:24
  • Hi Sohrab. See my answer below. Let me know if it works. I assumed that Var1 and Var2 are the _number_ of non-missing measurements, not the total _sum_ of the actual values for each triplet combination. – Edward Feb 21 '20 at 15:58

1 Answers1

0

See if this does what you want. I'm using the data.table package since you have a rather large (20M) dataset.

library(data.table)

setDT(df1)
setDT(df2)

# Set keys on the "triplet" to speed up everything
setkey(df1, Product_Category, Manufacture_Date, Country_Code)
setkey(df2, Product_Category, Manufacture_Date, Country_Code)

# Omit the Var1 and Var2 from df1
df1[, c("Var1", "Var2") := NULL]

# Reshape df2 to long form
df2.long <- melt(df2, measure=patterns("^Year_"))

# Split "variable" at the "_" to extract 4-digit year into "Control_Date" and delete leftovers.
df2.long[, c("variable","Control_Date") := tstrsplit(variable, "_", fixed=TRUE)][
  , variable := NULL]

# Group by triplet, Var1=count non-NA in value, join with... 
#   (Group by doublet, N=count non-NA), update Var2=N-Var1.
df2_N <- df2.long[, .(Var1 = sum(!is.na(value))), 
                   by=.(Product_Category, Manufacture_Date, Country_Code)][
                     df2.long[, .(N = sum(!is.na(value))), 
                              by=.(Manufacture_Date, Country_Code)], 
                     Var2 := N - Var1, on=c("Manufacture_Date", "Country_Code")]

# Update join: df1 with df2_N
df1[df2_N, c("Var1","Var2") := .(i.Var1, i.Var2), 
           on = .(Product_Category, Manufacture_Date, Country_Code)]

df1
   Product_ID Product_Category Manufacture_Date Control_Date Country_Code Var1 Var2
 1:          3                A             1940         1961          GHI    4    0
 2:          3                A             1940         1962          GHI    4    0
 3:          3                A             1940         1963          GHI    4    0
 4:          3                A             1940         1964          GHI    4    0
 5:          3                A             1940         1965          GHI    4    0
 6:          1                A             1950         1961          ABC    6    0
 7:          1                A             1950         1962          ABC    6    0
 8:          1                A             1950         1963          ABC    6    0
 9:          1                A             1950         1964          ABC    6    0
10:          1                A             1950         1965          ABC    6    0
11:          2                B             1960         1961          DEF   NA   NA
12:          2                B             1960         1962          DEF   NA   NA
13:          2                B             1960         1963          DEF   NA   NA
14:          2                B             1960         1964          DEF   NA   NA
15:          2                B             1960         1965          DEF   NA   NA

df2
   Product_ID Product_Category Manufacture_Date Country_Code Year_1961 Year_1962 Year_1963 Year_1964 Year_1965
1:          5                A             1940          DEF         6         3         6        10        NA
2:          3                A             1940          GHI        10         4        NA         9         7
3:          1                A             1950          ABC         5        NA         8        NA         6
4:          4                A             1950          ABC        NA         5         5        NA         4
5:          2                B             1940          DEF        NA        NA         6        NA        NA
6:          6                B             2000          GHI        NA        NA        NA        NA        NA
Edward
  • 10,360
  • 2
  • 11
  • 26