I am very new to R, but have a huge data set to work on and create certain variables. So far it has been going well, but I can't seem to find the right code for this problem. My data set looks something like this (only in way longer):
company product_type product_id release_year
A games 123 2001
A clothing 122 2002
B games 112 2000
C interior 124 2001
... ... ... ...
data1 <- structure(list(company = c("A", "A", "A", "B", "B", "A", "A", "A", "A", "C"),
product_type = c("Clothing", "Books", "Music", "Interior", "Games",
"Appliances", "Appliances", "Books", "Books", "Books"),
product_id = c(1064, 1636, 2064, 486, 646, 2159, 2195, 1528, 1538, 3611),
release_year = c(2007, 2009, 2007, 2008, 2007, 2002, 2004, 2002, 2004, 2004)),
row.names = c(NA, 10L), class = "data.frame")
I would like to create a new variable that, for every company and every year, tells me the product concentration (PC) The first step would be to find out the product type share every company has. I did it like this (I have split my data up into data frames for each company to make it easier)
games_2000 <- sum(data$product_type == "games" & data$release_year == "2000")
A_games_2000 <- nrow(A[A$product_type == "games" & data$release_year == 2000,])
A_share_games_2000 <- (as.numeric(A_games_2000)/as.numeric(games_2000))*100
I then did this for every genre and year and on each company dataframe, summing the previous years together as the products will be on the shelf for longer than a year.
Now I want to create the concentration within each company and used the following code
A$PC_2000 <- (as.numeric(A_share_games_2000)*as.numeric(A_share_games_2000)) + (as.numeric(A_share_clothing_2000)*as.numeric(A_share_clothing_2000)) + (as.numeric(A_share_interior_2000)*as.numeric(A_share_interior_2000))
A$PC <- rowMeans(A[, c("PC_2002", "PC_2003", "PC_2004", "PC_2005", "PC_2006", "PC_2007", "PC_2008", "PC_2009", "PC_2010")], na.rm=TRUE)
As you can see, this is a very long way to do it and it ended up giving me NaN for the HHI variable, which I cannot seem to fix.
I am sure that there must be a way more elegant way to put this whole formula into one function and loop it through, but I could not make it work, especially the summation of the years (2000 is the first year so it is very straight forward but in the variables for the following years it will still take the previous years into account and therefore the code will get longer with each row).