How to calculate p.value of each column in a data frame with NA values using shapiro.test in r?

Question

This is what I have tried so far. It works, but it only tells me the p.value of the data that has no NA's. Much of my data has NA values in a few places up to 1/3rd of the data.

normal <- apply(cor_phys, 2, function(x) shapiro.test(x)$p.value)

I want to try adding na.rm to the function, but it's not working. Help?

`normal <- apply(cor_phys, 2, function(x) shapiro.test(x[!is.na(x)])$p.value)` There is no argument you can apply to `shapiro.test` to remove `NA` values, you just need to subset the vector itself to exclude `NA`, then supply that to `shapiro.test`. — caldwellst, Feb 27 '20 at 19:26
@caldwellst `shapiro.test` has this line `x <- sort(x[complete.cases(x)])` which removes NAs so there must be something else causing the problem — rawr, Feb 27 '20 at 19:36
Ah, nice catch. Could be coming from this in the documentation of `shapiro.test` then. "Missing values are allowed, but the number of non-missing values must be between 3 and 5000." — caldwellst, Feb 27 '20 at 19:40
after adding in the `[!is.na(x)]` the output is still the same. Any columns with even one `NA` are omitted in the output. — Madison, Feb 27 '20 at 19:46
@MadisonPope can you add the output of `dput(head(cor_phys))` to your question — rawr, Feb 27 '20 at 19:47

score 0 · Accepted Answer · answered Mar 02 '20 at 17:50

#calculate the correlations between all variables
corres <- cor_phys %>%                  #cor_phys is my data
  as.matrix %>%
  cor(use="complete.obs") %>%           #complete.obs does not use NA
  as.data.frame %>%
  rownames_to_column(var = 'var1') %>%
  gather(var2, value, -var1)

#removes duplicates correlations
corres <- corres %>%
  mutate(var_order = paste(var1, var2) %>%
         strsplit(split = ' ') %>%
         map_chr( ~ sort(.x) %>% 
         paste(collapse = ' '))) %>%
  mutate(cnt = 1) %>%
  group_by(var_order) %>%
  mutate(cumsum = cumsum(cnt)) %>%
  filter(cumsum != 2) %>%
  ungroup %>%
  select(-var_order, -cnt, -cumsum)        #removes unneeded columns

I did not write this myself, but it is the answer that I used and worked for my needs. The link to the page I used is: How to compute correlations between all columns in R and detect highly correlated variables

How to calculate p.value of each column in a data frame with NA values using shapiro.test in r?

1 Answers1