Selecting a unique value from an R data frame

Question

If I have a table like this:

| FileName | Category| Value | Number |
|:--------:|:-------:|:-----:|:------:|
| File1    | Time    | 123   | 1      |
| File1    | Size    | 456   | 1      |
| File1    | Final   | 789   | 1      |
| File2    | Time    | 312   | 2      |
| File2    | Size    | 645   | 2      |
| File2    | Final   | 978   | 2      |
| File3    | Time    | 741   | 1      |
| File3    | Size    | 852   | 1      |
| File3    | Final   | 963   | 1      |
| File1    | Time    | 369   | 2      |
| File1    | Size    | 258   | 2      |
| File1    | Final   | 147   | 2      |
| File3    | Time    | 741   | 2      |
| File3    | Size    | 734   | 2      |
| File3    | Final   | 942   | 2      |
| File1    | Time    | 997   | 3      |
| File1    | Size    | 245   | 3      |
| File1    | Final   | 985   | 3      |
| File2    | Time    | 645   | 3      |
| File2    | Size    | 285   | 3      |
| File2    | Final   | 735   | 3      |
| File3    | Time    | 198   | 3      |
| File3    | Size    | 165   | 3      |
| File3    | Final   | 753   | 3      |

What means could I use in an R script to declare a variable that is the Value for each FileName where Number is minimum and Category is Time?

(EDIT: It should be noted that there are null entries in the Value column. Therefore, this code should be constructed to treat null entries as though they didn't exist so New Column doesn't end up filled with NA values.)

Then I'd like to merge this to form a new column on the existing table so that it now looks like this:

| FileName | Category | Value | Number | New Column |
|:--------:|:--------:|:-----:|:------:|------------|
| File1    | Time     | 123   | 1      | 123        |
| File1    | Size     | 456   | 1      | 123        |
| File1    | Final    | 789   | 1      | 123        |
| File2    | Time     | 312   | 2      | 312        |
| File2    | Size     | 645   | 2      | 312        |
| File2    | Final    | 978   | 2      | 312        |
| File3    | Time     | 741   | 1      | 741        |
| File3    | Size     | 852   | 1      | 741        |
| File3    | Final    | 963   | 1      | 741        |
| File1    | Time     | 369   | 2      | 369        |
| File1    | Size     | 258   | 2      | 369        |
| File1    | Final    | 147   | 2      | 369        |
| File3    | Time     | 741   | 2      | 741        |
| File3    | Size     | 734   | 2      | 741        |
| File3    | Final    | 942   | 2      | 741        |
| File1    | Time     | 997   | 3      | 997        |
| File1    | Size     | 245   | 3      | 997        |
| File1    | Final    | 985   | 3      | 997        |
| File2    | Time     | 645   | 3      | 645        |
| File2    | Size     | 285   | 3      | 645        |
| File2    | Final    | 735   | 3      | 645        |
| File3    | Time     | 198   | 3      | 198        |
| File3    | Size     | 165   | 3      | 198        |
| File3    | Final    | 753   | 3      | 198        |

Read the basic R documentation. Functions like `subset`, `min` and `==` will be of use in approaching this problem. — Frank, Jul 09 '15 at 16:51
@Frank I'm familiar with those. However, I can't seem to get the logic worked out with all of the different conditions. — JohnN, Jul 09 '15 at 16:53
@BondedDust: I think OP means category Time of a variable Measure. — user227710, Jul 09 '15 at 17:11
Still scratching my head. So Measure column just gets renamed to Time? Or is there some data to merge to? — IRTFM, Jul 09 '15 at 17:15
@BondedDust Measure as the column name is something I forgot to fix when I used Category instead. That was simply an error. — JohnN, Jul 09 '15 at 17:25

Chris · Accepted Answer · 2015-07-10T13:25:21.003

3

Using data.table:

(Edited to reflect @Frank's comments)

DT[, Benchmark := Value[Category == "Time"][which.min(Number[Category == "Time"])], by = FileName]

Breaking this down:

Number[Category == "Time"]

Take all Number where Category == Time

which.min(^^^)

Find which one is the minimum

Benchmark := Value[Category == "Time"][^^^]

Set the new column of benchmark to the value at this minimum

by = FileName

Do this by group

edited Jul 10 '15 at 13:25

answered Jul 09 '15 at 18:48

Chris

6,302
1
27
54

This error message came back: `Error: unexpected symbol in: "Testdf[, Benchmark := Value[which.min(Number[Measure == "Batch Time"])], by = group` – JohnN Jul 09 '15 at 18:58
1

I am testing using this data.table: `DT <- data.table(Value = 1:10,Measure = c("Batch Time","Other"), Number = 1:5, group = letters[1:3])`, and then `DT[, Benchmark := Value[which.min(Number[Measure == "Batch Time"])], by = group]` works for me. Is the later code exactly what is being run though R? – Chris Jul 09 '15 at 19:03
The tables above are entirely example data sets. The real data is much more varied in terms of how many `Category` entries there might be per `FileName` and where a `FileName` might not have any entries for a particular `Number`. – JohnN Jul 09 '15 at 19:05
1

@David Fair, but the error you are getting has nothing to do with the contents of your data.table. See: http://stackoverflow.com/questions/25889234/error-unexpected-symbol-input-string-constant-numeric-constant-special-in-my-co. Because I am not getting that error on my side, it is likely that there was an error transferring into your rscript – Chris Jul 09 '15 at 19:06
I tested it by removing everything but the code that calls the SQL table and the code you provided. Now when running it, the code doesn't appear to finish executing because the console goes from `>` to `+` – JohnN Jul 09 '15 at 19:07
1

@David you missed the close bracket `]` on copy then – Chris Jul 09 '15 at 19:08
Haha, i did. Well at least it was a stoopid mistake – JohnN Jul 09 '15 at 19:09
Okay so running that I get `Error in '[.data.frame'(Testdf, , ':='(Benchmark, Value[which.min(Number[Category== : unused argument (by = group)` – JohnN Jul 09 '15 at 19:11
1

@David Testdf is not a data.table then. you need to used `library(data.table)` and then `setDT(Testdf)`. – Chris Jul 09 '15 at 19:13
`Error in eval(expr, envir, enclos) : object 'group' not found` – JohnN Jul 09 '15 at 19:15
What is group? That jumped out as strange because that wasn't found in the original data – JohnN Jul 09 '15 at 19:15
1

sorry should be FileName – Chris Jul 09 '15 at 19:16
eureka! It works! :) I really appreciate it. Now I've got to figure out why my chart is still not working. Thanks! – JohnN Jul 09 '15 at 19:19
Any idea why `data.table` would cause a part of the data frame to print (unintentionally) on a PDF for a .Rnw file. – JohnN Jul 09 '15 at 19:45
@David As we talked about before, this seems to be a new problem. Search on SO first, and then if you can't find ask a new question. – Chris Jul 09 '15 at 19:46
Just reading the code, it doesn't make sense to me. I think you should have `Value[Category == "Time"][which.min(Number[Category == "Time"])]` instead. Alternately, you could do `Value[which.min(Number*(Category == "Time"))]`, assuming the number is nonnegative. – Frank Jul 09 '15 at 23:08
@Frank when I use my test data.table, I get equivalent solutions. Is there a case where mine would break down? It is also the quickest of the three in a microbenchmark test – Chris Jul 09 '15 at 23:23
1

@Chris Suppose we have `V = 1:4; X = c(2,2,2,1); CatTime = c(T,F,T,T)` There is a difference between `V[which.min(X[CatTime])] # 3` & `V[CatTime][which.min(X[CatTime])] # 4`. I think the latter is the correct value, but I didn't read the question carefully (though I can't imagine a question for which the first version would be right). Oh, I was wrong about multiplying them being okay...I'm too used to taking the max. – Frank Jul 10 '15 at 00:33
1

@Frank my mistake. My method worked for a few test cases, and I (very incorrectly) thought that data.table would subset the way I wanted. Your method is the desired outcome, I've edited to reflect – Chris Jul 10 '15 at 13:24
@Chris I've noticed that there is a point where this breaks down. I have a File entry where there is no entries until `Number` gets to 13. This leads to `NA` values in the `Benchmark` column for that file. – JohnN Jul 10 '15 at 19:18
@David did you use my fixed code above (current as of this morning)? The code should work for all possible values of number. Is the value in category for that row "Time" – Chris Jul 10 '15 at 21:06
@Chris I had a thought (that seems really obvious and stupid that I didn't mention it) that might be a source of breakdown in the codes usability. Will this code work when there are entries: `Measure == 'Time' | Value == null | Number == 1` `Measure == 'Time' | Value == 256| Number == 2` Will the code choose `Value == 256` for the `Benchmark` column or will it try to use `null`? – JohnN Jul 13 '15 at 13:50
should pull null. to fix this, use `[is.numeric(Value) & Category == "Time"]` instead of `[Category == "Time"]` in both locations above – Chris Jul 13 '15 at 14:22

Benjamin · Answer 2 · 2015-07-09T17:27:50.307

0

Untested, but should get you started:

Ref <- Table1 %>%
  mutate(Category2 = factor(Category, c("Time", "Size", "Final"),
    FileNumber = as.numeric(sub("File", "", FileName)),
    FilePrefix = "File") %>%
  arrange(FilePrefix, FileNumber, Category2, Value) %>%
  group_by(FilePrefix, FileNumber, Category2) %>%
  mutate(NewColumn = Value[1])

edited Jul 09 '15 at 17:27

answered Jul 09 '15 at 17:01

Benjamin

16,897
6
45
65

It's your first table. You didn't give it a name, so I named it Table 1. – Benjamin Jul 09 '15 at 17:18
okay that's what I assumed. I wanted to make sure. When you factor `Category` by listing the different category entries, will it make a difference if there are non equal distributions of Category names? For example, File1 might have `Time`, `Size`, `Final` where as File2 might have only `Time` and `Final` and File3 could have `Time`, `Size`, `Space`, `Final`. – JohnN Jul 09 '15 at 17:24
I think that depends on if you want it to matter. Any level that isn't named is going to be assigned missing. When the data frame sorts, it will push the missing values to the end. so if you don't care about retaining the value, then no, it doesn't matter. If you want that information, then it's a problem. A safer approach (that I should have mentioned initially) would be to do `mutate(Category2 = factor(Category......)`. Then you still have the original data. I'll edit the answer. – Benjamin Jul 09 '15 at 17:27
When I run this, I get `Error: could not find function "%>%"`. I thought I had seen that used before but I can't find any documentation on what that operator is. – JohnN Jul 09 '15 at 17:34
Also, when I look at the object `ref`, its only a dataframe with the FileName and NewColumn. – JohnN Jul 09 '15 at 17:36
Oops. Sorry. use `library(dplyr)` to run this code. Hopefully that takes care of it. – Benjamin Jul 09 '15 at 17:55
Let us [continue this discussion in chat](http://chat.stackoverflow.com/rooms/82861/discussion-between-david-and-benjamin). – JohnN Jul 09 '15 at 18:12
Its still giving me a data frame with only two columns – JohnN Jul 09 '15 at 18:14

Selecting a unique value from an R data frame

2 Answers2

Linked