0

I am currently working on an amazon dataset with many rows, which makes it hard to spot issues in the data. My goal is to look at the amazon data, and see whether certain products have a higher variance in star ratings than other ones. I have a variable indicating product ID (asin), a variable indicating the star rating (overall), and want to create a variance variable.

I have thus used dplyr's group_by function in combination with the mutate function. Even though all input variables don't have NAs/Missings, my output variable does. I have attempted to look for a solution, yet only found solutions on what to do if the input has NAs.

See my code attached:

any(is.na(data$asin))
#[1] FALSE
any(is.na(data$overall))
# [1] FALSE

#create variable that represents variance of rating, grouped by product type
data <- data %>% 
  group_by(asin) %>% 
  mutate(ProductVariance = var(overall))

any(is.na(data$ProductVariance))
#5226 [1] TRUE
> sum(is.na(data$ProductVariance))
# [1] 289

I would much appreciate your help! Even though the amount of NAs is not big regarding the number of reviews, I would still appreciate getting to accurate means (NAs hinder the usage of tapply) and being as precice as possible in follow-up analyses.

Thank you in advance!

Yvonne
  • 1
  • 2
    generally, it would be ideal to include some of your data in your question, so that your code and results are reproducible. You can do that with ```dput``` https://stackoverflow.com/questions/5963269/how-to-make-a-great-r-reproducible-example – desval May 11 '20 at 15:24

2 Answers2

3

var will return NA if the input is length one. So any ASINs that appear once in your data will have NA variance. Depending what you're doing with it, you may find it convenient to change those NAs to 0s:

var(1)
# [1] NA

... 
mutate(ProductVariance = coalesce(var(overall), 0))
Gregor Thomas
  • 136,190
  • 20
  • 167
  • 294
0

Is it possible that what you're seeing is that "empty" groups are not showing up? You can change the default with .drop.
When .drop = TRUE, empty groups are dropped.

Chuck P
  • 3,862
  • 3
  • 9
  • 20