0

I am having trouble grouping and summing the follwing data in R:

category freq
1    C1     9
2    C2    39
3    C3     3
4    A1    38
5    A2     2
6    A3    29
7    B1   377
8    B2   214
9    B3   790
10   B4   724
11   D1   551
12   D2   985
13   E5    19
14   E4    28

to look like this:

category freq
1    A    69
2    B    2105
3    C    51
4    D    1536
5    E    47

I usually use ddply to aggregate data by an attribute but this just adds all values rows with the same attribute in a given column. I need to be able to specify multiple attributes that should be lumped into one category.

ekad
  • 14,436
  • 26
  • 44
  • 46
Rick
  • 61
  • 5
  • 12
  • Please take a look at my modified answer. I have included a full example, as well as what the output would look like. – Mayou Aug 21 '13 at 20:02

1 Answers1

2

Why don't you add a column to your dataframe, that would be the letter part of your "Category" column. Then, you could use ddply.

Example:

 df = data.frame(id = c(1,2,3,4,5), category = c("AB1", "AB2", "B1", "B2", "B3"), freq = c(50,51,2,26))
 df$new = as.factor(gsub("\\d", "", df$category))

You could then use ddply based on the new column, as follows:

 library(plyr)
 aggregate <- ddply(df, .(new), summarize, freq = sum(freq))

You get the following result:

#  new freq
#1  AB  101
#2   B   31

This would work only if you intend to group all the categories with similar "alphabetical" substring under the same umbrella category.

If, HOWEVER, you wish to group custom categories under one category, (your example: KG, XM and L4 would be part of the same category), you could define new "super" categories, and assign each sub-category to the appropriate "super" category. One way that I can think of is the switch function. Please see example below:

 df = data.frame(id = c(1,2,3,4,5), category = c("A", "B", "KG", "XM", "L4"), freq = c(50,51,3,2,26))

 fct <- function(cat) {switch(cat, "A" = "CAT1", "B" = "CAT2", "KG" = "CAT3", "XM" = "CAT3", "L4"="CAT3")}
 df$new = as.factor(unlist(lapply(df$category, fct)))

 aggregate <- ddply(df, .(new), summarize, freq = sum(freq))

This will give you:

 #   new freq
 #1 CAT1   50
 #2 CAT2   51
 #3 CAT3   31
Mayou
  • 8,498
  • 16
  • 59
  • 98
  • The data I listed was just an example but I will actually need to define cases where something like KG, XM, and L4 will be aggregated into one category. – Rick Aug 21 '13 at 20:07
  • Not a problem. Let me modify my answer accordingly and get back to you. – Mayou Aug 21 '13 at 20:07
  • @Rick Please check my modified answer. I have replaced the `substr` function by `gsub`, which would extract the "alphabetical" portion of the column, no matter what the length of each category's name is. – Mayou Aug 21 '13 at 20:14
  • @Rick I just realized that you wish to customize your categories, beyond a classification based on similar "alphabetical" portions of the categories' names. I have included one possible way of achieving what you are looking for. Please take a look at my edited answer. Hope this helps. – Mayou Aug 21 '13 at 20:33
  • Again, this is a very simplified way of doing this. If you could provide a detailed description of how many subcategories and "super" categories you have, I could help you find a more efficient way of getting the result you want. – Mayou Aug 21 '13 at 20:41
  • 1
    Worked perfectly!! Thank you so much. – Rick Aug 26 '13 at 17:11