2

What I'm trying to achieve in R is the following: given a table (data frame in my case) - I want to be get the lowest price for each unique combination of two columns.

For example, given the following table:

+-----+-----------+-------+----------+----------+
| Key | Feature1  | Price | Feature2 | Feature3 |
+-----+-----------+-------+----------+----------+
| AAA |         1 |   100 | whatever | whatever |
| AAA |         1 |   150 | whatever | whatever |
| AAA |         1 |   200 | whatever | whatever |
| AAA |         2 |   110 | whatever | whatever |
| AAA |         2 |   120 | whatever | whatever |
| BBB |         1 |   100 | whatever | whatever |
+-----+-----------+-------+----------+----------+

I want a result that looks like:

+-----+-----------+-------+----------+----------+
| Key | Feature1  | Price | Feature2 | Feature3 |
+-----+-----------+-------+----------+----------+
| AAA |         1 |   100 | whatever | whatever |
| AAA |         2 |   110 | whatever | whatever |
| BBB |         1 |   100 | whatever | whatever |
+-----+-----------+-------+----------+----------+

So I'm working on a solution along the lines of:

s <- lapply(split(data, list(data$Key, data$Feature1)), function(chunk) { 
        chunk[which.min(chunk$Price),]})

But the result is a 1 x n matrix - so I need to unsplit the result. Also - it seems very slow. How can I improve this logic? I've seen solutions pointing in the directions of the data.table package. Should I re-write using that package?

Update

Great answers guys - thanks! However - my original dataframe contains more columns ( Feature2 ... ) and I need them all back after the filtering. The rows that do not have the lowest price ( for the combination of Key/Feature1 ) can be discarded, so I'm not interested in their values for Feature2 / Feature3

Jochen van Wylick
  • 5,303
  • 4
  • 42
  • 64
  • With what logic the values of the other columns are to be taken? If for instance `Feature2` has different values for the same key-feature1, which value must be included in the output? – nicola Jul 10 '15 at 15:40
  • The value belonging to the lowest price. So this thing needs to work as a row filter. So the 'whatever' of AAA-1, AAA-2, BBB-1. Rest of the rows can be discarded. – Jochen van Wylick Jul 10 '15 at 15:43

4 Answers4

3

You can use the dplyr package:

library(dplyr)

data %>% group_by(Key, Feature1) %>%
         slice(which.min(Price))
jeremycg
  • 24,657
  • 5
  • 63
  • 74
  • Great works - but I need to get all columns back in the result. I simplified the example a bit. In reality the data contains more columns, which I need in the result. – Jochen van Wylick Jul 10 '15 at 15:32
3

Since you referred to data.table package, I provide here the solution using that package:

library(data.table)
setDT(df)[,.(Price=min(Price)),.(Key, Feature1)] #initial question
setDT(df)[,.SD[which.min(Price)],.(Key, Feature1)] #updated question

df is your sample data.frame.

Update: Test using mtcars data

df<-mtcars
library(data.table)
setDT(df)[,.SD[which.min(mpg)],by=am]
   am  mpg cyl disp  hp drat   wt  qsec vs gear carb
1:  1 15.0   8  301 335 3.54 3.57 14.60  0    5    8
2:  0 10.4   8  472 205 2.93 5.25 17.98  0    3    4
user227710
  • 3,164
  • 18
  • 35
1

The base R solution would be aggregate(Price ~ Key + Feature1, data, FUN = min)

maccruiskeen
  • 2,748
  • 2
  • 13
  • 23
  • Very elegant - but I need to get all columns back in the result. I simplified the example a bit. In reality the data contains more columns, which I need in the result. – Jochen van Wylick Jul 10 '15 at 15:32
  • Do you mean you want the minimum value back in your original dataframe? If that's the case use `ave(data$Price, data$Key, data$Feature, FUN = min)`. – maccruiskeen Jul 10 '15 at 15:36
  • No - see the updated question - I only want the row with the lowest value ( for the unique combination of Key + Feature1 ) - but with all the original values. I tried your code and it returns only 3 columns: Key, Feature1 and Price - but I also need all the other original columns. – Jochen van Wylick Jul 10 '15 at 15:38
  • Are the data in other columns identical within each combination of Key and Feature1? If not do you want their mean? – maccruiskeen Jul 10 '15 at 15:45
  • It's irrelevant. If the row does not contain the minimum price ( for the combination of Key + Feature1 ) the row can be discareded. It needs to work as a filter on the rows. - thanks for all your effort btw. – Jochen van Wylick Jul 10 '15 at 15:51
  • 1
    Ah I see. Well the `dplyr` solution by jeremycg looks good. A `data.table` one would to do the same would be `setDT(data)[, lapply(.SD, min), by = list(Key, Feature1)]`, provided you did `data <- data.table(data)`. – maccruiskeen Jul 10 '15 at 15:54
  • Sorry for being unclear - I didn't expect my simplification of the problem would result in not giving enough information to come up with a solution. – Jochen van Wylick Jul 10 '15 at 15:57
  • @choff: I already posted a solution from `data.table` package. If you think that is not working, please let me know. – user227710 Jul 10 '15 at 16:05
  • 1
    @user227710 I see your update now. Your solution works. – maccruiskeen Jul 10 '15 at 16:07
0

Using R base aggregate

> aggregate(Price~Key+Feature1, min, data=data)
  Key Feature1 Price
1 AAA        1   100
2 BBB        1   100
3 AAA        2   110

See this post for other alternatives.

Community
  • 1
  • 1
Jilber Urbina
  • 58,147
  • 10
  • 114
  • 138