Outputting percentiles by filtering a data frame

Question

Note that, as requested in the comments, that this question has been revised.

Consider the following example:

df <- data.frame(FILTER = rep(1:10, each = 10), VALUE = 1:100)

I would like to, for each value of FILTER, create a data frame which contains the 1st, 2nd, ..., 99th percentiles of VALUE. The final product should be

PERCENTILE df_1 df_2 ... df_10 1 [first percentiles] 2 [second percentiles]

etc., where df_i is based on FILTER == i.

Note that FILTER, although it contains numbers, is actually categorical.

The way I have been doing this is by using dplyr:

nums <- 1:10
library(dplyr)
for (i in nums){
df_temp <- filter(df, FILTER == i)$VALUE
assign(paste0("df_", i), quantile(df_temp, probs = (1:99)/100))
}

and then I would have to cbind these (with 1:99 in the first column), but I would rather not type in every single df name. I have considered using a loop on the names of these data frames, but this would involve using eval(parse()).

Probably `mget` and `do.call`, but of course the _real_ problem here is the fact that you don't have all the `df_1, df_2`, etc in a list to begin with. — joran, Jul 06 '16 at 15:30
For the best possible answer, could you please tell us how you generated those data.frames? Not in detail, but did you use a loop to read in csvs using `assign`? — sebastian-c, Jul 06 '16 at 15:31
@sebastian-c Basically - I'm not sure if this is relevant to the problem at hand; I imagine it's not - I have to filter based on `i` from a data frame generated by `sqlQuery()`. For obvious reasons, I can't post that data on this website. The data frames, whose names contain `i`, are all dfs containing percentiles of the filtered data frame. They all have the same number of rows and have only one column, as in this example, and I would like to `cbind` them using a loop. — Clarinetist, Jul 06 '16 at 15:35
See gregor's answer in the [following post](http://stackoverflow.com/questions/17499013/how-do-i-make-a-list-of-data-frames) for some of the advantages in following joran's advice. — lmo, Jul 06 '16 at 15:35
You never need to post your _exact_ data, just data in the same format. I disagree with your assessment that these background issues don't matter. Experienced R programmers read your description and recognize the spot you are in because they have been there before and know that it was caused by a suboptimal decision _in the creation of the `df` objects_ in the first place. We're trying to help, honest. My guess is that earlier on you'd have been better off using `split`, but it's impossible to tell without more information. — joran, Jul 06 '16 at 15:43

Anton · Answer 1 · 2016-07-06T16:27:02.523

I suggest that you use a list.

list_of_dfs <- list()
nums <- 1:10
for (i in nums){
  list_of_dfs[[i]] <- nums*i
}

df <- data.frame(list_of_dfs[[1]])
df <- do.call("cbind",args=list(df,list_of_dfs))
colnames(df) <- paste0("df_",1:10)

You'll get the result you want:

   df_1 df_2 df_3 df_4 df_5 df_6 df_7 df_8 df_9 df_10
1     1    2    3    4    5    6    7    8    9    10
2     2    4    6    8   10   12   14   16   18    20
3     3    6    9   12   15   18   21   24   27    30
4     4    8   12   16   20   24   28   32   36    40
5     5   10   15   20   25   30   35   40   45    50
6     6   12   18   24   30   36   42   48   54    60
7     7   14   21   28   35   42   49   56   63    70
8     8   16   24   32   40   48   56   64   72    80
9     9   18   27   36   45   54   63   72   81    90
10   10   20   30   40   50   60   70   80   90   100

Please note, as requested in the comments, I've revised my question. — Clarinetist, Jul 06 '16 at 16:00
Thank you for the suggestion: I've replaced the second loop with do.call. However, how would you assign the list_of_dfs using lapply? — Anton, Jul 06 '16 at 16:27

score 1 · Accepted Answer · answered Jul 06 '16 at 17:17

Here's a basic outline of a possibly smoother approach. I have not included every single aspect of your desired output, but the modification should be fairly straightforward.

df <- data.frame(FILTER = rep(1:10, each = 10), VALUE = 1:100)
df_s <- lapply(split(df,df$FILTER),
                             FUN = function(x) quantile(x$VALUE,probs = c(0.25,0.5,0.75)))
out <- do.call(cbind,df_s)
colnames(out) <- paste0("df_",colnames(out))

> out
    df_1  df_2  df_3  df_4  df_5  df_6  df_7  df_8  df_9 df_10
25% 3.25 13.25 23.25 33.25 43.25 53.25 63.25 73.25 83.25 93.25
50% 5.50 15.50 25.50 35.50 45.50 55.50 65.50 75.50 85.50 95.50
75% 7.75 17.75 27.75 37.75 47.75 57.75 67.75 77.75 87.75 97.75

I did this for just 3 quantiles to keep things simple, but it obviously extends. And you can add the 1:99 column afterwards as well.

I appreciate your help on this question! Thank you - and I can't believe I didn't know about `split` or `do.call` until now! :) — Clarinetist, Jul 06 '16 at 17:23

score 0 · Answer 3 · answered Jul 06 '16 at 15:36

0

How about using get?

df <- data.frame(1:10)

for (i in nums) {

  df <- cbind(df, get(paste0("df_", i)))

}

# get rid of first useless column
df <- df[, -1]

# get names
names(df) <- paste0("df_", nums)
df

answered Jul 06 '16 at 15:36

Choubi

640
3
9

1

I understand the reason for not recommending it but then again, his question is litterally: "I would like to cbind them using a loop." – Choubi Jul 06 '16 at 15:53
Please note, as requested in the comments, I've revised my question. – Clarinetist Jul 06 '16 at 16:00

Outputting percentiles by filtering a data frame

3 Answers3