0

I have a data frame that I want to run some statistical tests on. However, I want to group the data based on one of the columns first.

Here's an example data frame:

CATEGORY   ITEM     SHOP1 STOCK   SHOP2 STOCK
 Fruit    Orange         5             9
 Fruit    Apple         12            32
 Fruit     Pear         17             6
  Veg    Carrots        59            72
  Veg    Potatoes        6            57
  Veg   Courgette       43            22
  Veg    Parsnips        5             9
  ...      ...         ...           ...

So for this example, I want to look at the chi squared distribution but across categories - so I want to reduce the data to a table like this:

          SHOP1 SHOP2
   FRUIT    34    47
     VEG   113   160

Where the table shows the sum of the stock for each category for each shop (this is a very simplified version - the data that I have runs to 37 categories over a few hundred rows), and no longer specifies the item, just the category.

So I thought I could group_by(CATEGORY) and then run the chi squared test on the grouped data, but that doesn't seem to work. I think I need to add up the two columns with numbers in, but I don't know how to do that in conjunction with the chi squared testing. I've been faffing with this for some time now with no luck, so I'd really appreciate your help!

Rose
  • 137
  • 2
  • 10

2 Answers2

1

In the future, it would be helpful if you wrote the code that wasn't working and its output. From what I understand, you are trying to create that table based on the data frame. Is that correct?

This has already been answered pretty well by a previous post: How to sum a variable by group?

From that post, it seems the answer would be:

df %>% group_by(CATEGORY) %>% summarise(SHOP1 = sum(SHOP1), SHOP2 = sum(SHOP2))

Community
  • 1
  • 1
  • Thanks for your response. I was attempting to make the table based on the dataframe, and to then run chi squared on it. The answer in the link you gave me made the table, but running chi squared on the table then gave the error `all entries of 'x' must be nonnegative and finite`. – Rose Sep 29 '16 at 06:54
1

We can use dplyr to summarise the data and the tidy function from the broom package to return the results of chisq.test in a data frame:

library(broom)
library(dplyr)

df %>% group_by(CATEGORY) %>%
  summarise_at(vars(matches("SHOP")), sum) %>%
  do(tidy(chisq.test(.[, grep("SHOP",names(.))])))
     statistic p.value parameter                                                       method
1 2.566931e-30       1         1 Pearson's Chi-squared test with Yates' continuity correction
eipi10
  • 91,525
  • 24
  • 209
  • 285