0

I have a data frame like this

data.frame(age=c("(0,5]", "(5,10]", "(10,15]", "(15,20]", "(20,25]", "(25,30]"),
           C1=c(0, 0, 0, 0, 0, 0),
           C2=c(0, 0, 0, 0, 0, 0),
           C3=c(0, 270, 30, 4, 0, 0),
           C4=c(0, 30, 30, 4, 0, 0))

Just that the columns starting with C are +50. I'm going to use https://stackoverflow.com/a/10139458/792066 to create a pareto chart with the C columns, but the sheer amount of labels makes the chart pretty worthless. The usual solution is to create a new column called "others" with those that aren't top 5~10. I suppose I'm looking for what summarize() does for factor columns with categorical variables. How can I sum all columns into a new column if their sum isn't in the range of the top X?

alistaire
  • 42,459
  • 4
  • 77
  • 117
Braiam
  • 1
  • 11
  • 47
  • 78

1 Answers1

0

Here's a base R approach using colSums and rowSums:

df <- data.frame(age = c("(0,5]", "(5,10]", "(10,15]", "(15,20]", "(20,25]", "(25,30]"),
                 C1 = c(0, 0, 0, 0, 0, 0),
                 C2 = c(0, 0, 0, 0, 0, 0),
                 C3 = c(0, 270, 30, 4, 0, 0),
                 C4 = c(0, 30, 30, 4, 0, 0))

others <- names(sort(-colSums(df[-1]))[-1:-2])

df$others <- rowSums(df[others])

df_lumped <- df[!names(df) %in% others]

df_lumped
#>       age  C3 C4 others
#> 1   (0,5]   0  0      0
#> 2  (5,10] 270 30      0
#> 3 (10,15]  30 30      0
#> 4 (15,20]   4  4      0
#> 5 (20,25]   0  0      0
#> 6 (25,30]   0  0      0

You would need to adjust [-1:-2] depending the amount of columns you want to keep. For example [-1:-5] would keep the top 5.

Braiam
  • 1
  • 11
  • 47
  • 78
alistaire
  • 42,459
  • 4
  • 77
  • 117