0

I am fairly new to R and am trying to translate some data from a dataframe called df into a data table. My dataframe looks as such:

    preds        ground_truth     group
1   0.0008786491            0     1       
2   0.0009080505            1     1       
3   0.0009118593            0     1        
4   0.0009121987            1     2       
6   0.0009514780            0     2         
7   0.0009572834            1     3         
8   0.0009645682            0     4         
9   0.0009721006            1     4         
10  0.0009761475            0     5         
11  0.0009835458            0     5   

There are several pieces of information I wish to be extracted from this, most of which I have managed successfully.

For each unique group, I want the average value for preds, I want the average value for ground_truth, the count of preds in each unique group and finally the range of preds. I have managed to get all of these but the problem lies in the range making 2 rows for each group for the min and max instead of being on a single line in any format.

I have tried using lists, c(), as.character() but nothing has worked.

The output looks like this with the first row number being the min and second row being the max:

    Group_number                range    N   predicted_mean   actual_mean
 1:            1    0.479342132806778 6492        0.55383       0.715
 2:            1    0.855185627937317 6492        0.55383       0.715
 3:            2    0.407937824726105 6492        0.44054       0.532
 4:            2    0.479312479496002 6492        0.44054       0.532

I wanted the column range to contain any format that will allow both the values in a single row:

        Group_number                range                         N   predicted_mean   actual_mean
     1:            1    (0.479342132806778, 0.855185627937317)   6492        0.55383       0.715

My solution so far has been this:

group_results <- data.table(Group_number = numeric(), range=numeric(), N=numeric(), 
                          predicted_mean=numeric(), actual_mean=numeric())
for (i in unique(df$group)){
  pred <- df$preds[df['group'] == i]
  actual <- df$ground_truth[df['group'] == i]
  predicted_mean <- sum(pred)/length(pred)
  actual_mean <- sum(actual)/length(actual)
  range <- c(min(pred), max(pred))
  N <- length(pred)
  group_results <- rbind(group_results, list(i, range, N, round(predicted_mean, 5),
                                           round(actual_mean, 3)))
}

Can someone please tell me how I would fix range to be on a single line in data.table.

Thanks

geds133
  • 1,503
  • 5
  • 20
  • 52

1 Answers1

0

Don't use for loop for such tasks. These are grouping operations and different libraries have got various functions to do what you want. For example in data.table you can do :

library(data.table)

setDT(df)[, .(preds = mean(preds), 
              ground_truth = mean(ground_truth), 
              N = .N, 
              range = list(range(preds))), group]

#   group    preds ground_truth N             range
#1:     1 0.000900        0.333 3 0.000879,0.000912
#2:     2 0.000932        0.500 2 0.000912,0.000951
#3:     3 0.000957        1.000 1 0.000957,0.000957
#4:     4 0.000968        0.500 2 0.000965,0.000972
#5:     5 0.000980        0.000 2 0.000976,0.000984

Since you want only one row for each group, the range values are stored in a list.


The same in dplyr can be done as :

library(dplyr)
df %>%
  group_by(group) %>%
  summarise(preds = mean(preds), 
            ground_truth = mean(ground_truth), 
            N = n(), 
            range = list(range(preds)))

data

df <- structure(list(preds = c(0.0008786491, 0.0009080505, 0.0009118593, 
0.0009121987, 0.000951478, 0.0009572834, 0.0009645682, 0.0009721006, 
0.0009761475, 0.0009835458), ground_truth = c(0L, 1L, 0L, 1L, 
0L, 1L, 0L, 1L, 0L, 0L), group = c(1L, 1L, 1L, 2L, 2L, 3L, 4L, 
4L, 5L, 5L)), class = "data.frame", row.names = c(NA, -10L))
Ronak Shah
  • 377,200
  • 20
  • 156
  • 213
  • Really great answer. Is it possible to name the columns myself? The average of `preds` is just called `preds` when I would like to name it `predicted_average` etc. – geds133 Aug 28 '20 at 09:05
  • Sure you can change `.(preds = mean(preds),` to `.(predicted_average = mean(preds),` and same for other values as well. – Ronak Shah Aug 28 '20 at 09:10
  • Great thanks. One other problem not stated in the question is that my actual group numbers look like this : `(-129,6.49e+03]`. The data.table looks exactly as I want but is there a way to simply replace the group column with numbers? So first row becomes group 1, second becomes group 2 – geds133 Aug 28 '20 at 09:15
  • 1
    I guess you use `cut` to receive such group names. In your `cut` command you can specify `label = FALSE` and it would give you numbers. – Ronak Shah Aug 28 '20 at 09:20
  • Worked as expected. Many thanks – geds133 Aug 28 '20 at 10:00