1

Suppose I have this data frame df

df <- data.frame("City" = c("Boston", "Boston","Boston", "Boston","Boston", "Boston",
                        "Boston", "Boston", "Boston", "New York", "New York", "New York" ),
             "Store_ID" = c("00002", "00002", "00002", "00002", "00004", "00004",
                            "00004", "00004", "00004", "00011", "00011", "00011"),
             "Customer_ID" = c("10001", "10001", "10001", "23847", "17823", "17823",
                               "17823", "17823", "17823", "24232", "24232", "27381"),
             "Product_ID" = c ("00013", "00013", "00058", "00013", "00899", "00847",
                               "00065", "00065", "00065", "00096", "00085", "00175"),
             "Payment" = c("Cash", "Cash", "Cash", "Card", "Card", "Card", "Card", 
                           "Card", "Card", "Card", "Card", "Cash"))

Let's say I want to know how many products were sold in each city; then I'd use this code

df2 <- df %>% group_by(City) %>% summarise(Quantity = (n))

Or if I want to know the quantity of products sold in each store I can expand the previous code, such as:

df2 <- df %>% group_by(City, Store_ID) %>% summarise(Quantity = (n))

However, this further splits the data frame and now I cannot see total number of products sold in each city. Is it possible to create a new data frame that contains counts of different groups but is only grouped by a more encompassing variable such as City or Store.

An example output that I'm looking for Store00002 only would be like this:

Store  Total_Sales   Customer10001_purchases    Customer23847_purchases   Cash% (ratio of items paid in cash)
00002           4                         3                          1      0.75

Is it possible to do this through dplyr? I'm also open to any other suggestions. Really appreciate the assistance!

memokerobi
  • 143
  • 10

2 Answers2

4

If you want to keep the original data and just add the sums, use mutate instead of summarise. So it might look something like:

df1 <- df %>% 
  group_by(City) %>% 
  mutate(sales_city = n())%>% 
  ungroup() %>% 
  group_by(City, Store_ID) %>% 
  mutate(sales_store = n()) %>% 
  ungroup() %>% 
  group_by(City,Store_ID, Customer_ID) %>% 
  mutate(sales_by_customer = n()) %>% 
  ungroup() %>% 
  select(-Product_ID, -Payment) %>% 
  distinct()

Note that I did use your examples of number of products sold by city and store. However, your final dataframe had some more columns so I added one extra example of how it might look. Whether these (the n() everywhere) are the exact statistics you want to have is something for you to decide.

This will give the following result:

  City     Store_ID Customer_ID sales_city sales_store sales_by_customer
1 Boston   00002    10001                9           4                 3
2 Boston   00002    23847                9           4                 1
3 Boston   00004    17823                9           5                 5
4 New York 00011    24232                3           3                 2
5 New York 00011    27381                3           3                 1

I do not think you need the ungroup every time, as your as doing a similar grouping. But it always feels safer to just include it.

Annet
  • 846
  • 3
  • 14
0

Your required output has three sets of information:

  1. Total sales,
  2. Individual customer purchases (in wide format),
  3. Proportion (ratio?) of cash sales.

I believe this requires three separate "queries" on the data, with the results joined on Store_ID (or City if you like, but my code does it for Store_ID).

A small problem is that different stores, and therefore different cities, have different customers, so you can't really have a tidy output like you show for all stores. You can, but there will be lots of NAs.

I'll do this one step at a time:


SALES <- df %>% 
  group_by(Store_ID) %>%
  summarise(Total_sales = n())

Edit: New query to include Stores who had no cash sales:

CASH <- df %>%
  group_by(Store_ID, Payment) %>%
  summarise(n = n()) %>%
  pivot_wider(names_from=Payment, values_from=n, 
              values_fill = list(n = 0)) %>%
  mutate(`Cash%` = Cash/(Card+Cash)) %>%
  select(-Card, -Cash)

Old query which ignores stores with no Cash sales:

CASH_wrong <- df %>% 
  group_by(Store_ID, Payment) %>%
  summarise(n = n()) %>%
  mutate(`Cash%` = n/sum(n)) %>%
  filter(Payment=="Cash") %>%
  select(-n, -Payment)

CUSTOMERS <- df %>% 
  group_by(Store_ID, Customer_ID) %>%
  summarise(Total_sales = n()) %>%
  pivot_wider(names_from = Customer_ID, values_from=Total_sales, names_prefix="customer")

Join all 3 on Store_ID to get your desired output.

inner_join(SALES, CUSTOMERS, by="Store_ID") %>%
  inner_join(CASH, by="Store_ID") %>%
  filter(Store_ID=="00002") %>%  # Choose the store here.
  select_if(!is.na(.))

# A tibble: 1 x 6
  Store_ID Total_sales customer10001 customer23847 `Cash%`
  <fct>          <int>         <int>         <int>   <dbl>
1 00002              4             3             1    0.75

inner_join(SALES, CUSTOMERS, by="Store_ID") %>%
  inner_join(CASH, by="Store_ID") %>%
  filter(Store_ID=="00004") %>%  # Choose the store here.
  select_if(!is.na(.))

# A tibble: 1 x 4
  Store_ID Total_sales customer17823 `Cash%`
  <fct>          <int>         <int>   <dbl>
1 00004              5             5       0
Edward
  • 10,360
  • 2
  • 11
  • 26