2

I have a problem with a dataframe with individuals with same ID but different expenses at varius categories like supermarkets, healt, cars, etc. My dataframe is like this:

Base=data.frame(ID=c("CED1","CED2","CED3","CED1","CED1","CED3","CED3","CED2","CED2","CED4"),Value=c(10,20,10,30,50,10,10,20,30,30),Categorie=c("Markets","Markets","Health","Cars","Cars","Health","Cars","Health","Cars","Markets"))

    ID   Value Categorie
1  CED1    10   Markets
2  CED2    20   Markets
3  CED3    10    Health
4  CED1    30      Cars
5  CED1    50      Cars
6  CED3    10    Health
7  CED3    10      Cars
8  CED2    20    Health
9  CED2    30      Cars
10 CED4    30   Markets

How you can see I have different ID and categories. I would like indicators from this data frame in a new one, and have someone like this:

ID   Total.Value   Max.Value  Min.Value  Average.Value  %Markets  %Health  %Cars
CED1     90           50         10           30           11%       0%      89%
CED2     70           30         20           23.33        28.5%    28.5%   42.8%
CED3     30           10         10           10           33.3%    33.3%   33.3%
CED4     30           30         30           30           100%      0%       0%

I am trying to develop this dataframe using plyr but I don't get correct indicators. Thanks for your help.

Duck
  • 39,058
  • 13
  • 42
  • 84
  • How are the percentages `11, 0 and 89%` for `CED1`? – Arun Mar 18 '13 at 15:48
  • What exactly is *wrong* with this? Can you post a) the code you are using and b) your expected output. Always post [a small reproducible example](http://stackoverflow.com/q/5963269/1478381) – Simon O'Hanlon Mar 18 '13 at 15:51
  • @Arun - because CED1 doesn't spend in health. Only in cars (twice) and markets – Simon O'Hanlon Mar 18 '13 at 15:51
  • @SimonO101, yes, but how does the value 11 and 89 come up? I thought it should be 33 and 67 %? – Arun Mar 18 '13 at 15:53
  • @Arun The values are 10 and {30 + 50}, with a total of 90. Smallest is 10, average is 90/3 and largest is 50. % spend in markets is 10/90 and % spend in cars is 80/90. It is correct?! What am I missing? – Simon O'Hanlon Mar 18 '13 at 15:55

2 Answers2

3

Here's a ddply solution.

library(plyr)
ddply(Base, .(ID), summarise, Total = sum(Value),
      Max.Value = max(Value),
      Min.Value = min(Value),
      Average.Value = mean(Value),
      "%Markets" = sum(Value[Categorie == "Markets"])/sum(Value) * 100,
      "%Health" = sum(Value[Categorie == "Health"])/sum(Value) * 100,
      "%Cars" = sum(Value[Categorie == "Cars"])/sum(Value) * 100)

The result:

    ID Total Max.Value Min.Value Average.Value  %Markets  %Health    %Cars
1 CED1    90        50        10      30.00000  11.11111  0.00000 88.88889
2 CED2    70        30        20      23.33333  28.57143 28.57143 42.85714
3 CED3    30        10        10      10.00000   0.00000 66.66667 33.33333
4 CED4    30        30        30      30.00000 100.00000  0.00000  0.00000
Sven Hohenstein
  • 80,497
  • 17
  • 145
  • 168
1

Here's a data.table solution:

require(data.table)
dt <- data.table(Base, key="ID")
dt[, as.list(c(total=sum(Value), max=max(Value), 
    min=min(Value), mean=mean(Value), 
    tapply(Value, Categorie, sum)/sum(Value) * 100)), 
by=ID]
#      ID total max min     mean     Cars   Health   Markets
# 1: CED1    90  50  10 30.00000 88.88889       NA  11.11111
# 2: CED2    70  30  20 23.33333 42.85714 28.57143  28.57143
# 3: CED3    30  10  10 10.00000 33.33333 66.66667        NA
# 4: CED4    30  30  30 30.00000       NA       NA 100.00000

Here, you can replace NA with 0. If you insist on getting 0 directly instead of NA, then:

dt[, {tt <- tapply(Value, Categorie, sum)/sum(Value); ## compute ratio for percentage
      tt[is.na(tt)] <- 0; 
      as.list(c(total=sum(Value),                     ## total
          summary(Value)[c(6,1,4)],                   ## max, min and mean
          tt* 100))                                   ## percentages
     }, 
by=ID]

#      ID total Max. Min.  Mean     Cars   Health   Markets
# 1: CED1    90   50   10 30.00 88.88889  0.00000  11.11111
# 2: CED2    70   30   20 23.33 42.85714 28.57143  28.57143
# 3: CED3    30   10   10 10.00 33.33333 66.66667   0.00000
# 4: CED4    30   30   30 30.00  0.00000  0.00000 100.00000

Here, I've also shown how you could use summary function to get some values instead of writing them one by one.

Arun
  • 116,683
  • 26
  • 284
  • 387