-3

This is the basically same problem I had in Excel a few days ago (Excel - find nth largest value based on criteria), but this time in R (the data set contains half a million entries and that is more than Excel seems to be able to handle).

I have a table that looks like this that I have imported from Excel:

Country Region      Code Product name Year Value
Sweden  Stockholm   123  Apple        1991 244   
Sweden  Kirruna     123  Apple        1987 100
Japan   Kyoto       543  Pie          1987 544
Denmark Copenhagen  123  Apple        1998 787
Denmark Copenhagen  123  Apple        1987 100
Denmark Copenhagen  543  Pie          1991 320
Denmark Copenhagen  126  Candy        1999 200
Sweden  Gothenburg  126  Candy        2013 300
Sweden  Gothenburg  157  Tomato       1987 150
Sweden  Stockholm   125  Juice        1987 250
Sweden  Kirruna     187  Banana       1998 310
Japan   Kyoto       198  Ham          1987 157
Japan   Kyoto       125  Juice        1987 550
Japan   Tokyo       125  Juice        1991 100

What I want to do is to make a code that can give me the sum of the nth largest value of products that have been sold in a specific country. For instance, the most sold product in Sweden is Apple so I want to code to find that apple is the most sold product (in total, which is what I am interested in) and then summaries all the values of the sold apples in the country Sweden, 344.

I also want to be able to find the nth largest value based on both country and year. That is, if I am looking for the most sold product in Sweden in the year 2013, it should return the product Candy and the value 300.

KGB91
  • 630
  • 2
  • 6
  • 24
  • 2
    Have you tried anything for yourself so far? – LAP Sep 17 '18 at 07:26
  • 2
    This seems to be two separate questions. [Please ask only one question per post](https://meta.stackexchange.com/questions/222735/can-i-ask-only-one-question-per-post) – Ronak Shah Sep 17 '18 at 07:26
  • Could you provide a bigger data example? Right now the data does not include enough rows to select the nth most sold, as there are at max 2 per country. – LAP Sep 17 '18 at 07:37
  • @KGB91 The most sold product in `Sweden` is `Candy`. Please find my answer. I think, your example is not clear. – Sal-laS Sep 17 '18 at 08:06
  • I will check your answer, thanks! But no, `Apple` is the most sold product in `Sweden`. `344` vs `300`. – KGB91 Sep 17 '18 at 08:11

1 Answers1

2

Solution for your first question (find most sold product per country, summarise value for this product) using dplyr:

library(tidyverse)

df %>%
  group_by(Country, Product_name) %>%
  summarise(sum_value = sum(Value, na.rm = TRUE)) %>%
  ungroup() %>%
  group_by(Country) %>%
  filter(sum_value == max(sum_value))

# A tibble: 3 x 3
# Groups:   Country [3]
  Country Product_name sum_value
   <fctr>       <fctr>     <int>
1 Denmark        Apple       887
2   Japan        Juice       650
3  Sweden        Apple       344

Solution for second question (show nth most sold products per country and year, sum value):

df %>%
  group_by(Country, Product_name, Year) %>%
  summarise(sum_value = sum(Value, na.rm = TRUE)) %>%
  ungroup() %>%
  group_by(Country, Year) %>%
  arrange(desc(sum_value), .by_group = TRUE) %>%
  slice(., 1:2)

Had to change the data a bit to get a decent output, so here's the output with all years set to 1987 (change the 2 in the 1:2 within the last row for a different n):

# A tibble: 6 x 4
# Groups:   Country, Year [3]
  Country Product_name  Year sum_value
   <fctr>       <fctr> <int>     <int>
1 Denmark        Apple  1987       887
2 Denmark          Pie  1987       320
3   Japan        Juice  1987       650
4   Japan          Pie  1987       544
5  Sweden        Apple  1987       344
6  Sweden       Banana  1987       310
LAP
  • 6,605
  • 2
  • 15
  • 28
  • It returns: ´Error in df %>% group_by(Country, Product_name) %>% summarise(sum_value = sum(Value)) %>% : could not find function "%>%"´ (I tried to install the package) – KGB91 Sep 17 '18 at 07:53
  • Sorry, install the package `tidyverse`. I'll edit it in. Working on the second question right now. – LAP Sep 17 '18 at 07:54
  • I ran ´install.packages("tidyverse")´ and the your code. It still says: Error in library("tidyverse") : there is no package called ‘tidyverse’ Error in df %>% group_by(Country, Product_name) %>% summarise(sum_value = sum(Value)) %>% : could not find function "%>%" Thanks a lot again for all your effort! – KGB91 Sep 17 '18 at 07:58
  • Did it successfully install after you ran the `install.packages("tidyverse")`? – LAP Sep 17 '18 at 07:59
  • It only says `Binaries will be installed` and then trying URL 'https://cran.rstudio.com/bin/windows/contrib/3.5/BH_1.66.0-1.zip' Content type 'application/zip' length 17880019 bytes (17.1 MB) downloaded 17.1 MB – KGB91 Sep 17 '18 at 08:01
  • 1
    Hm, maybe try restarting your R session. The code above (now with solutions for both problems) will work when you get `tidyverse` running. – LAP Sep 17 '18 at 08:04
  • Thanks! Once I get it installed, I'll accept your answer. Did you change the data in Excel but rearranging the order and adding the labels `` and ``? – KGB91 Sep 17 '18 at 08:08
  • The `` and `` is just the column classes which are presented by `tidyverse` (`factor` and `integer`). – LAP Sep 17 '18 at 08:13
  • The package seems to work now. But I get `Error in UseMethod("filter_") : no applicable method for 'filter_' applied to an object of class "function"` – KGB91 Sep 17 '18 at 08:43
  • Let us [continue this discussion in chat](https://chat.stackoverflow.com/rooms/180185/discussion-between-lap-and-kgb91). – LAP Sep 17 '18 at 08:46
  • Solved by LAP's code! Ty everyone for the help :) – KGB91 Sep 17 '18 at 09:25