1

I have a dataset that looks like this:

   UserID    Query     Asthma    Stroke    
   142       abc dr    0         0
   142       asthma    1         0
   142       stroke    0         1
   145       stroke    0         1
   145       pizza     0         0

There are hundreds of thousands of UserIDs and each user submitted a variable number of queries. In order to do further analysis, I need to sum "Asthma" and "Stroke" for each UserID. Any advice? Can you recommend resources for dealing with this type of dataset?

Thank you in advance... I'm very new to this.

andrly
  • 21
  • 2
  • `tapply` might do nicely. `tapply(Asthma, INDEX=list(UserID), sum)`. If that's not what you want, you might want to include more details in your question. – Jota Jul 01 '13 at 20:53
  • 1
    Surely a duplicate and many times over with one of several answers being `aggregate(dfrm[, c("Asthma", "Stroke")], dfrm$UserID)` since the default function for aggregate is `sum`. – IRTFM Jul 01 '13 at 20:53
  • @DWin, :). That's a "broad" duplicate :D – Arun Jul 01 '13 at 20:57
  • I admit I didn't put a lot of effort into finding a narrow duplicate, but I didn't think the OP put much effort into searching for an answer either. Feel free to find a better one and post it. Unless you really think this is a "new" question of course? – IRTFM Jul 01 '13 at 21:00
  • @Dwin, it's a pointer in a direction, at least. Thank you for that. – andrly Jul 01 '13 at 21:01
  • @DWin, I dint mean it in a wrong way. Just bad humour I suppose. Certainly not a "new" question. I'll see if I can find a closer duplicate. – Arun Jul 01 '13 at 21:19
  • @user2535082, In general, have a look at [**these questions**](https://encrypted.google.com/search?{google:acceptedSuggestion}oq=aggregate+multiple+columns+R+stackoverflow&sourceid=chrome&ie=UTF-8&q=aggregate+multiple+columns+R+stackoverflow). – Arun Jul 01 '13 at 21:25

1 Answers1

3

You can use ddply function from plyr package for that.

Assume your dataset is sample:

install.packages("plyr")
library(plyr)
ddply(sample,.(UserID), summarize,sumAsthma=sum(Asthma),sumStroke=sum(Stroke))   

Note: You can use numcolwise() if you have more than one numeric column.

ddply(sample,.(UserID),numcolwise(sum))
Metrics
  • 15,172
  • 7
  • 54
  • 83