1

I'm having a dataframe with both numerical and categorical variables.

The data frame has 2 keys where I need group by. When the rows are grouped by keys so & li, then categorical variables need to be selected based on mode & numerical by average.

   SO      LI        A         B
1 2000     20        P          22
2 2000     20        P          40
3 1000     10        Q          80

The output needs to be,

   SO      LI        A         B
1 2000     20        P          31
2 1000     10        Q          80

I used the following code so far.

library(plyr)
groupColumns = c("so","li")
dataColumns = c(colnames(numericalColumns(Planning_DF)))
res = ddply(Planning_DF, groupColumns, function(x) colMeans(x[dataColumns]))
head(res)

so the numerical column grouping & average happening. How to get the categorical variables Mode to this?

user3789200
  • 1,166
  • 2
  • 25
  • 45

1 Answers1

1

It is easier with dplyr

library(dplyr)
groupColumns = c("SO","LI")
Planning_DF %>%
    group_by(across(all_of(groupColumns))) %>%
    summarise(across(where(is.numeric), mean), 
           across(where(is.character), Mode), .groups = 'drop')

-output

# A tibble: 2 × 4
     SO    LI     B A    
  <int> <int> <dbl> <chr>
1  1000    10    80 Q    
2  2000    20    31 P    

where

Mode <- function(x) {
  ux <- unique(x)
  ux[which.max(tabulate(match(x, ux)))]
}

If we need to do this in plyr, use colwise with a custom function that does the mean or Mode based on the column type

f1 <- function(x) if(is.numeric(x)) mean(x, na.rm = TRUE) else Mode(x)
plyr::ddply(Planning_DF, .variables = groupColumns, 
     .fun = plyr::colwise(f1))

-output

    SO LI A  B
1 1000 10 Q 80
2 2000 20 P 31

data

Planning_DF <- structure(list(SO = c(2000L, 2000L, 1000L), LI = c(20L, 20L, 
10L), A = c("P", "P", "Q"), B = c(22L, 40L, 80L)), 
class = "data.frame", row.names = c("1", 
"2", "3"))
akrun
  • 874,273
  • 37
  • 540
  • 662
  • 1
    Thanks alot Akrun... May i know the use of this please .groups = 'drop' – user3789200 Oct 20 '21 at 17:45
  • 1
    @user3789200 it is from the `summarise` options. Byd efault, it removes the last group attribute (if there is only one, it drops it), but there is a message in the console which may be distracting. So, i used it. You can check [here](https://stackoverflow.com/questions/62140483/how-to-interpret-dplyr-message-summarise-regrouping-output-by-x-override/62140681#62140681) for more iinfo – akrun Oct 20 '21 at 17:46