I have a data frame that I want to run some statistical tests on. However, I want to group the data based on one of the columns first.
Here's an example data frame:
CATEGORY ITEM SHOP1 STOCK SHOP2 STOCK
Fruit Orange 5 9
Fruit Apple 12 32
Fruit Pear 17 6
Veg Carrots 59 72
Veg Potatoes 6 57
Veg Courgette 43 22
Veg Parsnips 5 9
... ... ... ...
So for this example, I want to look at the chi squared distribution but across categories - so I want to reduce the data to a table like this:
SHOP1 SHOP2
FRUIT 34 47
VEG 113 160
Where the table shows the sum of the stock for each category for each shop (this is a very simplified version - the data that I have runs to 37 categories over a few hundred rows), and no longer specifies the item, just the category.
So I thought I could group_by(CATEGORY)
and then run the chi squared test on the grouped data, but that doesn't seem to work. I think I need to add up the two columns with numbers in, but I don't know how to do that in conjunction with the chi squared testing. I've been faffing with this for some time now with no luck, so I'd really appreciate your help!