R find top n results of column operation on aggregate operation per column over dataframe

Question

Say I have a dataframe called RaM that holds cumulative return values. In this case, they literally are just a single row of cumulative return values along with column headers, but I would like to apply the logic to not just single row dataframes.

Say I want to sort by the max cumulative return value of each column, or even the average, or the sum of each column.

So each column would be re-ordered so that the max cumulative returns for each column is compared and the highest return becomes the 1st column with the min being the last column

then say I want to derive either the top 10 (1st 10 columns after they are rearranged), or even the top 10%.

I know how to derive the column averages, but I don't know how to effectively do the remaining operations. There is an order function, but when I used it, it stripped my column names, which I need. I could easily then cut the 1st say 10 columns, but is there a way that preserves the names? I don't think I can easily extract the names from the unordered original dataframe and apply it with my sorted by aggregate dataframe. My goal is to extract the column names of the top n columns (in dataframe RaM) in terms of a column aggregate function over the entire dataframe.

something like

top10 <- getTop10ColumnNames(colSums(RaM))

that would then output a dataframe of the top 10 columns in terms of their sum from RaM

Here's output off RaM

> head(RaM,2)
                 ABMD        ACAD       ALGN        ALNY        ANIP      ASCMA        AVGO        CALD        CLVS       CORT
2013-01-31 0.03794643 0.296774194 0.13009009  0.32219178  0.13008130 0.02857604  0.13014640 -0.07929515  0.23375000  0.5174825
2013-02-28 0.14982079 0.006633499 0.00255102 -0.01823456 -0.05755396 0.07659708 -0.04333138  0.04066986 -0.04457953 -0.2465438
                  CPST         EA         EGY        EXEL FCSC       FOLD        GNC         GTT      HEAR          HK         HZNP
2013-01-31 -0.05269663 0.08333333 -0.01849711  0.01969365    0  0.4179104 0.07992677 0.250000000 0.2017417  0.10404624 -0.085836910
2013-02-28  0.15051595 0.11443102 -0.04475854 -0.02145923    0 -0.2947368 0.14079036 0.002857143 0.4239130 -0.07068063 -0.009389671
                  ICON        IMI        IMMU        INFI        INSY        KEG        LGND        LQDT         MCF        MU
2013-01-31  0.07750896 0.05393258 -0.01027397 -0.01571429 -0.05806459 0.16978417 -0.03085824 -0.22001958  0.01345609 0.1924290
2013-02-28 -0.01746362 0.03091684 -0.20415225  0.19854862  0.36849503 0.05535055  0.02189055  0.06840289 -0.09713487 0.1078042
                NBIX      NFLX       NVDA       OREX       PFPT          PQ       PRTA         PTX        RAS        REXX        RTRX
2013-01-31 0.2112299 0.7846467 0.00000000 0.08950306 0.06823721  0.03838384 -0.1800819  0.04387097 0.23852335 0.008448541  0.34328358
2013-02-28 0.1677704 0.1382251 0.03888981 0.04020979 0.06311787 -0.25291829  0.0266223 -0.26328801 0.05079882 0.026656512 -0.02222222
                  SDRL      SHOS         SSI        STMP         TAL         TREE        TSLA      TTWO        UVE       VICL
2013-01-31  0.07826093 0.2023956 -0.07788381  0.07103175 -0.14166875 -0.030504714  0.10746974 0.1053588  0.0365299  0.2302405
2013-02-28 -0.07585546 0.1384419  0.08052150 -0.09633197  0.08009728 -0.002860412 -0.07144761 0.2029581 -0.0330408 -0.1061453
                   VSI       VVUS        WLB
2013-01-31  0.06485356 -0.0976155 0.07494647
2013-02-28 -0.13965291 -0.1156069 0.04581673

Please provide some sample data or use an example built-in dataset so we can understand what you want. We don't know what your return values look like and it's unclear what happens with multiple rows. See [how to make a reproducible example](https://stackoverflow.com/questions/5963269/how-to-make-a-great-r-reproducible-example) — Calum You, Aug 01 '18 at 22:24

Calum You · Accepted Answer · 2018-08-02T00:07:42.747

Here's one way using the first section of your sample data to illustrate. You can gather up all the columns so that we can do summary calculations more easily, calculate all the summaries by group that you want, and then sort with arrange. Here I ordered with the highest sums first, but you could do whatever order you wanted.

library(tidyverse)
ram <- read_table2(
  "ABMD        ACAD       ALGN        ALNY        ANIP      ASCMA        AVGO        CALD        CLVS       CORT
0.03794643 0.296774194 0.13009009  0.32219178  0.13008130 0.02857604  0.13014640 -0.07929515  0.23375000  0.5174825
0.14982079 0.006633499 0.00255102 -0.01823456 -0.05755396 0.07659708 -0.04333138  0.04066986 -0.04457953 -0.2465438"
)
summary <- ram %>%
  gather(colname, value) %>%
  group_by(colname) %>%
  summarise_at(.vars = vars(value), .funs = funs(mean = mean, sum = sum, max = max)) %>%
  arrange(desc(sum))
summary
#> # A tibble: 10 x 4
#>    colname    mean     sum    max
#>    <chr>     <dbl>   <dbl>  <dbl>
#>  1 ALNY     0.152   0.304  0.322 
#>  2 ACAD     0.152   0.303  0.297 
#>  3 CORT     0.135   0.271  0.517 
#>  4 CLVS     0.0946  0.189  0.234 
#>  5 ABMD     0.0939  0.188  0.150 
#>  6 ALGN     0.0663  0.133  0.130 
#>  7 ASCMA    0.0526  0.105  0.0766
#>  8 AVGO     0.0434  0.0868 0.130 
#>  9 ANIP     0.0363  0.0725 0.130 
#> 10 CALD    -0.0193 -0.0386 0.0407

If you then want to reorder your original data frame, you can get the order from this summary output and index with it:

ram[summary$colname]
#> # A tibble: 2 x 10
#>      ALNY    ACAD   CORT    CLVS   ABMD    ALGN  ASCMA    AVGO    ANIP
#>     <dbl>   <dbl>  <dbl>   <dbl>  <dbl>   <dbl>  <dbl>   <dbl>   <dbl>
#> 1  0.322  0.297    0.517  0.234  0.0379 0.130   0.0286  0.130   0.130 
#> 2 -0.0182 0.00663 -0.247 -0.0446 0.150  0.00255 0.0766 -0.0433 -0.0576
#> # ... with 1 more variable: CALD <dbl>

Created on 2018-08-01 by the reprex package (v0.2.0).

this worked. I modified my code seamleslly inserting a data.frame(RaM) because I didn't have a data frame haha. Thank you so much, add this as the answer and I'll accept it. I am on my way to some good portolio investing — thistleknot, Aug 02 '18 at 00:39
yes, thank you. I've been knee deep in r for the past few days, this helped tremendeously! I owe you infinite internet gratitude and I truly appreciate it. I was doing the operation by hand and was wondering how for the life of me, I was able to do exactly what I needed and even feed the list of names back into a portfolio optimizer. — thistleknot, Aug 02 '18 at 06:00

R find top n results of column operation on aggregate operation per column over dataframe

1 Answers1