0

I am working with a large dataset (10 million + cases) where each case represents a sale's monthly transactions of a given product (there are 17 products). As such, each shop is potentially represented across 204 cases (12 months * 17 Product sales; note, not all stores sell all 17 products throughout the year).

I need to restructure the data so that there is one case for each product transaction. This would result in each shop being represented by only 17 cases.

Ideally, I would like the create the mean value of the transactions over the 12 months.

To be more specific, there dataset currently has 5 variables:

  • Shop Location — A unique 6 digit sequence
  • Month — 2013_MM (data is only from 2013)
  • Number of Units sold Total Profit (£)
  • Product Type - 17 Different product types (this is a String Variable)

I am working in R. It would be ideal to save this restructured dataset into a data frame.

I'm thinking an if/for loop could work, but I'm unsure how to get this to work.

Any suggestions or ideas are greatly appreciated. If you need further information, please just ask!

Kind regards,

R

RMAkh
  • 123
  • 1
  • 10
  • 2
    Please provide a **minimal, self contained example**. Check these links for general ideas, and how to do it in R: [**here**](http://stackoverflow.com/help/mcve), [**here**](http://www.sscce.org/), [**here**](http://adv-r.had.co.nz/Reproducibility.html), and [**here**](http://stackoverflow.com/questions/5963269/how-to-make-a-great-r-reproducible-example/5963610#5963610). Please also show us the [**code you have tried**](http://mattgemmell.com/what-have-you-tried/) and explain why it didn't meet your needs. – Henrik Jan 22 '15 at 14:31
  • To add to Henrik's comment, insert `dput(head(YOUR_DATA_SET))` into your question. It sounds like your just looking to perform a simple `group_by` – maloneypatr Jan 22 '15 at 14:58

1 Answers1

1

There really wasn't much here to work with, but this is what my interpretation leads to... You're looking to summarise your data set, grouped by shop_location and product_type

# install.packages('dplyr')
library(dplyr)

your_data_set <- xxx

your_data_set %>%
  group_by(shop_location, product_type) %>%
  summarise(profit = sum(total_profit),
            count = n(),
            avg_profit = profit/count)
maloneypatr
  • 3,562
  • 4
  • 23
  • 33
  • This looks spot on ‚ thanks. I'm not at my computer right to try it, but I was wondering what you mean by the "%>%" operators? – RMAkh Jan 22 '15 at 17:53
  • It's a chaining function in the `dplyr` library. Basically, it allows you to write one continuous function opposed to declaring variables at every step along the way. Take a look here: http://cran.r-project.org/web/packages/dplyr/dplyr.pdf – maloneypatr Jan 22 '15 at 18:54