0

I am very new to R, but have a huge data set to work on and create certain variables. So far it has been going well, but I can't seem to find the right code for this problem. My data set looks something like this (only in way longer):

company product_type product_id release_year  
A       games        123        2001
A       clothing     122        2002
B       games        112        2000
C       interior     124        2001
...     ...          ...        ...
data1 <- structure(list(company = c("A", "A", "A", "B", "B", "A", "A", "A", "A", "C"), 
               product_type = c("Clothing", "Books", "Music", "Interior", "Games", 
                                "Appliances", "Appliances", "Books", "Books", "Books"), 
               product_id = c(1064, 1636, 2064, 486, 646, 2159, 2195, 1528, 1538, 3611), 
               release_year = c(2007, 2009, 2007, 2008, 2007, 2002, 2004, 2002, 2004, 2004)), 
          row.names = c(NA, 10L), class = "data.frame")

I would like to create a new variable that, for every company and every year, tells me the product concentration (PC) The first step would be to find out the product type share every company has. I did it like this (I have split my data up into data frames for each company to make it easier)

games_2000 <- sum(data$product_type == "games" & data$release_year == "2000")
A_games_2000 <- nrow(A[A$product_type == "games" & data$release_year == 2000,])
A_share_games_2000 <- (as.numeric(A_games_2000)/as.numeric(games_2000))*100

I then did this for every genre and year and on each company dataframe, summing the previous years together as the products will be on the shelf for longer than a year.

Now I want to create the concentration within each company and used the following code

A$PC_2000 <- (as.numeric(A_share_games_2000)*as.numeric(A_share_games_2000)) + (as.numeric(A_share_clothing_2000)*as.numeric(A_share_clothing_2000)) + (as.numeric(A_share_interior_2000)*as.numeric(A_share_interior_2000)) 

A$PC <- rowMeans(A[, c("PC_2002", "PC_2003", "PC_2004", "PC_2005", "PC_2006", "PC_2007", "PC_2008", "PC_2009", "PC_2010")], na.rm=TRUE)

As you can see, this is a very long way to do it and it ended up giving me NaN for the HHI variable, which I cannot seem to fix.

I am sure that there must be a way more elegant way to put this whole formula into one function and loop it through, but I could not make it work, especially the summation of the years (2000 is the first year so it is very straight forward but in the variables for the following years it will still take the previous years into account and therefore the code will get longer with each row).

Jule
  • 1
  • 1
  • Hey, Jule. If you could provide some reproducible data (```dput()``` or sample data), I would be grateful to help you. How to have your question 99% asked, see here - https://stackoverflow.com/questions/5963269/how-to-make-a-great-r-reproducible-example – rg4s Aug 26 '20 at 08:15
  • Hey, thanks for your fast reply! I did create the reproducible data and added it into my original question, however the code I used on it will probably be too long to put it in here... – Jule Aug 26 '20 at 09:26
  • https://drive.google.com/file/d/1oSEZNHPQ90ePyM1K4-eR4sY1cD0d9Qdl/view?usp=sharing here is a link to the sample R Script, if you would like to look at it in detail – Jule Aug 26 '20 at 09:33
  • Thank you, Jule. But, to be honest, I did not get the notion of what you want. Can you, please, clarify your question? For example, why do you divide the number of Games, released in 2000, by number of rows of A company? May be it is better to divide Games released in 2000 by everything else released in 2000 to discover the share? I don't know the ultimate goal. However, I would highly recommend you to familiarize with ```tidyverse``` library documentation. Functions from this lib will make your code neat and tidy. – rg4s Aug 26 '20 at 10:26
  • Can you show expected output for the data shared (`data1`) ? – Ronak Shah Aug 26 '20 at 11:16
  • What I basically want to do with the dividing is to receive the market share but only regarding the specific product types. For example, company A has 50% of the Clothing market. Afterwards I want to do a formula similar to the HHI, summing all the squared market shares of each company so I know how concentrated their product range is. – Jule Aug 26 '20 at 15:10
  • The expected output would look like the data above it, but with 10 rows instead of 4, if that is what you mean? – Jule Aug 26 '20 at 15:11

0 Answers0