2

I have a dataframe in R (p2.df) that has aggregated a range of values into the following (there are many more columns this is just an abridge version):

genre       rating  cc      dd      ee
Adventure   FAILURE 140393  20865   358806
Adventure   SUCCESS 197182  32872   492874
Fiction     FAILURE 140043  14833   308602
Fiction     SUCCESS 197725  28848   469879
Sci-fi      FAILURE 8681    1682    24259
Sci-fi      SUCCESS 7439    1647    22661

I want to get the net values of the proportions for each column, which I can get in a spreadsheet but can't in R studio.

The formula in the spreadsheet follows the pattern:

net_cc = (cc(success)/(cc(success)+dd(success)+ee(success)) - (cc(fail)/(cc(fail)+dd(fail)+ee(fail))

What I want to get out in R is this table that I can get from the spreadsheet:

genre       net_cc          net_dd          net_ee
Adventure   0.002801373059  0.005350579467  -0.008151952526
Fiction     -0.01825346696  0.009417699223  0.008835767735
Sci-fi      -0.01641517271  0.003297091109  0.0131180816

Any ideas how? If it's any use I created the p2.df by summarising a previous table as:

library(dplyr)

p2.df<- s2.df %>% group_by(genre,rating) %>% summarise_all(sum)  
DTYK
  • 1,098
  • 1
  • 8
  • 33
JRUK
  • 67
  • 7
  • Thanks all, I selected Moody's as the answer as it was the simplest (I couldn't get utubun's neater one to work) but MKR's also worked. – JRUK Jun 24 '18 at 10:47
  • ...and then it stopped working. I think it's because I used 'summarise_at' to get the above dataframe and it doesn't like working with groups. – JRUK Jun 24 '18 at 12:57
  • that's probably because you created your data set by `data.frame()` or read it by `read.csv()` which by default convert strings to factors. I wrote my example using data with `rating` and `genre` converted to `character`, that's a default for `tible` and `read_csv` from `readr`. Please see the data @MKR used in his answer (last row - `stringsAsFactors = FALSE`). – utubun Jun 24 '18 at 15:59
  • 1
    Thanks yes you were right, the table had groupings so I added as.data.frame() that fixed it. – JRUK Jun 26 '18 at 18:51

3 Answers3

2

It's always better to work on data in long format. But if OP doesnt want to transform data in long format due to any constraint (e.g. number of columns are more which will lead to large number of rows in long format etc) then a solution in using dplyr::summarise_at can be achieved as:

library(dplyr)

df %>% mutate(rowSum = rowSums(.[,names(df)[3:5]])) %>%
  group_by(genre) %>%
  summarise_at(vars(names(df)[3:5]),
              funs(net = .[rating == "SUCCESS"]/rowSum[rating == "SUCCESS"] - 
                         .[rating == "FAILURE"]/rowSum[rating == "FAILURE"] )) %>%
  as.data.frame()

#       genre       cc_net      dd_net       ee_net
# 1 Adventure  0.002801373 0.005350579 -0.008151953
# 2   Fiction -0.018253467 0.009417699  0.008835768
# 3    Sci-fi -0.016415173 0.003297091  0.013118082

Data:

df <- read.table(text="
genre       rating  cc      dd      ee
Adventure   FAILURE 140393  20865   358806
Adventure   SUCCESS 197182  32872   492874
Fiction     FAILURE 140043  14833   308602
Fiction     SUCCESS 197725  28848   469879
Sci-fi      FAILURE 8681    1682    24259
Sci-fi      SUCCESS 7439    1647    22661",
header = TRUE, stringsAsFactors = FALSE)
MKR
  • 19,739
  • 4
  • 23
  • 33
  • it's a neat intuitive solution, but you can clean it a bit further, you could use just `rowSums(.[,3:5])` on 1st line and then `summarise_at(3:5,...` – moodymudskipper Jun 24 '18 at 11:13
  • @Moody_Mudskipper Thats elegant suggestion. I had done same at first. But, the problem was that for `summarise_at` it was expected to be as `2:4` since one column was out for grouping. Hence, I thought it would be easier to relate if I use `3:5` at both places. – MKR Jun 24 '18 at 11:29
  • It could be something related to different versions, for me it works with `3:5` and returns an error with `2:4` . I'm using dplyr_0.7.5 – moodymudskipper Jun 24 '18 at 11:38
  • see: https://stackoverflow.com/questions/45883513/using-dplyr-summarise-at-with-column-index/51009642#51009642 – moodymudskipper Jun 24 '18 at 11:49
  • Thanks for your help on this, you say it's better to work in long format which I'm happy to do as these solutions seem to be behaving temperamentally on me - when I expand them into my own summary table sometimes it works and sometimes it'll tell me objects (column names) are not found or Column `rowSum` must be length 2 (the group size) or one, not 16. Perhaps I should try the long form rather than from the summary table. – JRUK Jun 24 '18 at 12:36
  • Thanks both for your help. I'm now trying to get the overall net value for each column (ie not splitting by genre but doing similar maths). I have adapted this to the following but I get zero (sorry I can't even get a linebreak in this comment. Realise I may have to post a new query) . df %>% mutate(rowSum = rowSums(.[,names(df)[3:5]])) %>% group_by(rating) %>% summarise_at(vars(names(df)[3:5]), funs(net = .["rating"]/rowSum["rating"] )) %>% as.data.frame() – JRUK Jun 27 '18 at 19:46
  • @JRUK In that case you dont have to even use `group_by`. Just use `df %>% summarise_at(vars(names(df)[3:5]), funs(net = sum(.))) %>% as.data.frame() `. Please let me know if you need to have relative net values. – MKR Jun 27 '18 at 19:58
  • @Moody_Mudskipper Sorry. I had missed your comments earlier. I'm not sure about this change in `dplyr`. I'm using `0.7.4` version of `dplyr` though. – MKR Jun 27 '18 at 20:02
  • Thanks @MKR that didn't quite give what I was after, I didn't explain myself well so I have clarified and [posted as a new question](https://stackoverflow.com/questions/51090065/getting-net-values-as-a-proportion-from-a-dataframe-in-r-part-2). – JRUK Jun 28 '18 at 19:29
1

using tidyverse:

library(tidyverse)
df %>% gather(,,3:5) %>%
  spread(rating,value) %>%
  group_by(genre) %>%
  transmute(key,net = SUCCESS/sum(SUCCESS) - FAILURE/sum(FAILURE)) %>%
  ungroup %>%
  spread(key,net)

# # A tibble: 3 x 4
# genre           cc      dd       ee
#   <chr>        <dbl>   <dbl>    <dbl>
# 1 Adventure  0.00280 0.00535 -0.00815
# 2 Fiction   -0.0183  0.00942  0.00884
# 3 Sci-fi    -0.0164  0.00330  0.0131 
moodymudskipper
  • 46,417
  • 11
  • 121
  • 167
1

My answer is very close to @MKR answer, however, I just wish to point out, that we can make use of decoded rating (SUCESS = 1 and FAILURE = -1`) variable to avoid subsetting in the last part:

df %>% 
  mutate(rating = (rating == "SUCCESS")*2 - 1, denom = rowSums(.[3:5])) %>%
  group_by(genre) %>%
  summarise_at(vars(cc:ee), funs(sum(rating * . / denom)))

   #   A tibble: 3 x 4
   #   genre           cc      dd       ee
   #   <chr>        <dbl>   <dbl>    <dbl>
   # 1 Adventure  0.00280 0.00535 -0.00815
   # 2 Fiction   -0.0183  0.00942  0.00884
   # 3 Sci-fi    -0.0164  0.00330  0.0131 
utubun
  • 4,400
  • 1
  • 14
  • 17