R: summarise multiple column (numeric, character) and remove NAs

Question

I have a data.frame with many columns (~50). Some of them are character, some are numeric and 3 of them I use for grouping.

I need to:

remove NAs from numeric columns
calculate the mean of each of the numeric columns
extract the first element of the character columns

Let's say, we're using modified iris data as below:

data(iris)
iris$year <- rep(c(2000,3000),each=25) ## for grouping
iris$color <- rep(c("red","green","blue"),each=50) ## character column
iris[1,] <- NA ## introducing NAs

I have ~50 columns in total, numeric and character mixed together. I've been trying something like:

giris <- group_by(iris, Species, year)
cls <- unlist(sapply(giris, class)) ## find out classes
action <- ifelse(cls == "numeric", "mean", "first")
action <- paste(action)
summarise_each(giris, action)

What I get is means for all columns in a group followed by columns with the first values in respective group. And NAs are not handled... Which is not exactly what I seek...

Help anyone?

talat · Accepted Answer · 2016-01-18T17:03:03.620

You could try this with an if/else in the funs of summarise_each:

iris %>% 
  group_by(Species, year) %>% 
  summarise_each(funs(if(is.numeric(.)) mean(., na.rm = TRUE) else first(.)))

Since you have some NA's also in grouping columns, you could add a filter statement:

iris %>% 
  filter(!is.na(Species) & !is.na(year)) %>% 
  group_by(Species, year) %>% 
  summarise_each(funs(if(is.numeric(.)) mean(., na.rm = TRUE) else first(.)))
#Source: local data frame [6 x 7]
#Groups: Species [?]
#
#     Species  year Sepal.Length Sepal.Width Petal.Length Petal.Width color
#      (fctr) (dbl)        (dbl)       (dbl)        (dbl)       (dbl) (chr)
#1     setosa  2000        5.025    3.479167       1.4625       0.250   red
#2     setosa  3000        4.984    3.376000       1.4640       0.244   red
#3 versicolor  2000        6.012    2.776000       4.3120       1.344 green
#4 versicolor  3000        5.860    2.764000       4.2080       1.308 green
#5  virginica  2000        6.576    2.928000       5.6400       2.044  blue
#6  virginica  3000        6.600    3.020000       5.4640       2.008  blue

To avoid potential NA's in the color column (or any non-numeric columns), you could modify it to first(na.omit(.)).

You could also try data.table:

library(data.table)
setDT(iris)
iris[!is.na(Species) & !is.na(year), lapply(.SD, function(x) {
     if(is.numeric(x)) mean(x, na.rm = TRUE) else x[!is.na(x)][1L]}), 
     by = list(Species, year)]
#      Species year Sepal.Length Sepal.Width Petal.Length Petal.Width color
#1:     setosa 2000        5.025    3.479167       1.4625       0.250   red
#2:     setosa 3000        4.984    3.376000       1.4640       0.244   red
#3: versicolor 2000        6.012    2.776000       4.3120       1.344 green
#4: versicolor 3000        5.860    2.764000       4.2080       1.308 green
#5:  virginica 2000        6.576    2.928000       5.6400       2.044  blue
#6:  virginica 3000        6.600    3.020000       5.4640       2.008  blue

I think you may want to add `na.omit()` above the `group_by`, but this is the gist of the question. — JasonAizkalns, Jan 18 '16 at 14:45
@JasonAizkalns, I don't think so - this would potentially remove many rows that should be kept. But I agree that a `filter(!is.na(Species) & !is.na(year))` would make sense — talat, Jan 18 '16 at 14:46
This is a really nice and (most important) working solution that also allows to get a glimpse of what is going under the hood of summarise_each. I am really, REALLY grateful for it, Doncendo :-) . At the same time, do you think that there may be a way to speed it up? — rpl, Jan 18 '16 at 16:50
@rpl, glad it is useful for you. I don't think there will be significant performance within dplyr for this code. You could try replacing `first(na.omit(.))` with `.[!is.na(.)][1L]`. I might also add a similar approach using data.table which may be faster. — talat, Jan 18 '16 at 16:58

score 0 · Answer 2 · answered Jan 18 '16 at 14:54

I give it a try:

1. For the first point you mention, I would do something like the following (which isn't necessary for the second point):

na.omit(iris[ , which(sapply(iris, class) == "numeric")])

To separate the columns bei either numeric or character, I use the following:

iris[ , which(sapply(iris, class) == "numeric")]
iris[ , which(sapply(iris, class) == "character")]

2. The second task I combine the above line with colMeans:

colMeans(iris[ , which(sapply(iris, class) == "numeric")], na.rm = TRUE)

3. To extract the first element of the character columns, you could simply do:

iris[1, which(sapply(iris, class) == "character")]

In the mentioned case for the iris data, the first row is completly NA, even the character columns, so I would iterate to find the first-non-NA-row

k <- 1
while(any(is.na(FirstCharacterElement <- iris[k, which(sapply(iris, class) == "character")]))){
  k <- k + 1
}

Be careful about the class factor (which break up the code in the case of the iris data where column Species is of class factor and you maybe expect it to be a character column. You can check this with sapply(iris, class) and change it with e.g.

iris$Species <- as.character(iris$Species) #or with similar column names

When you read in the data you can mention the parameter stringsAsFactors = FALSE of functions read.table, read.csv or similar.

R: summarise multiple column (numeric, character) and remove NAs

2 Answers2