62

I'm trying to get the min/max for each column in a large data frame, as part of getting to know my data. My first try was:

apply(t,2,max,na.rm=1)

It treats everything as a character vector, because the first few columns are character types. So max of some of the numeric columns is coming out as " -99.5".

I then tried this:

sapply(t,max,na.rm=1)

but it complains about max not meaningful for factors. (lapply is the same.) What is confusing me is that apply thought max was perfectly meaningful for factors, e.g. it returned "ZEBRA" for column 1.

BTW, I took a look at Using sapply on vector of POSIXct and one of the answers says "When you use sapply, your objects are coerced to numeric,...". Is this what is happening to me? If so, is there an alternative apply function that does not coerce? Surely it is a common need, as one of the key features of the data frame type is that each column can be a different type.

Community
  • 1
  • 1
Darren Cook
  • 27,837
  • 13
  • 117
  • 217
  • 2
    I would pass on only the columns that have a meaningful data-type to calculate your statistic. – Roman Luštrik Sep 05 '11 at 09:01
  • @Roman Thanks, that in fact is what I did yesterday, as in this particular case I already had a list of numeric column name. But it can become time-consuming for large data frames. – Darren Cook Sep 06 '11 at 03:59
  • 1
    You can find the columns that are numeric and automate the process. – Roman Luštrik Sep 06 '11 at 06:48
  • @DarrenCook As an approach, if you read the file with stringsAsFactors = FALSE and before using `apply` if you set the columns to class that they are supposed to belong to for e.g. dates as as.POSIXct, numbers as numeric etc., is that easier than wrangling with coercion inside `sapply` ? – vagabond Oct 30 '14 at 22:05
  • This is an excellent question, and there still isn't really a satisfactory method for applying functions to a data.frame with mixed types. The only solution that preserves the type of each column is to use a for loop; there is no lapply method for data.frames. – Ben Rollert Aug 27 '15 at 02:23

8 Answers8

49

If it were an "ordered factor" things would be different. Which is not to say I like "ordered factors", I don't, only to say that some relationships are defined for 'ordered factors' that are not defined for "factors". Factors are thought of as ordinary categorical variables. You are seeing the natural sort order of factors which is alphabetical lexical order for your locale. If you want to get an automatic coercion to "numeric" for every column, ... dates and factors and all, then try:

sapply(df, function(x) max(as.numeric(x)) )   # not generally a useful result

Or if you want to test for factors first and return as you expect then:

sapply( df, function(x) if("factor" %in% class(x) ) { 
            max(as.numeric(as.character(x)))
            } else { max(x) } )

@Darrens comment does work better:

 sapply(df, function(x) max(as.character(x)) )  

max does succeed with character vectors.

IRTFM
  • 258,963
  • 21
  • 364
  • 487
  • Thanks. The 2nd sapply example works and answers the question perfectly (I found it worked even better if removing the as.numeric() clause, and let max work directly on the character strings) – Darren Cook Sep 06 '11 at 03:54
  • Yes, that would generally be more useful. – IRTFM May 03 '14 at 17:12
21

The reason that max works with apply is that apply is coercing your data frame to a matrix first, and a matrix can only hold one data type. So you end up with a matrix of characters. sapply is just a wrapper for lapply, so it is not surprising that both yield the same error.

The default behavior when you create a data frame is for categorical columns to be stored as factors. Unless you specify that it is an ordered factor, operations like max and min will be undefined, since R is assuming that you've created an unordered factor.

You can change this behavior by specifying options(stringsAsFactors = FALSE), which will change the default for the entire session, or you can pass stringsAsFactors = FALSE in the data.frame() construction call itself. Note that this just means that min and max will assume "alphabetical" ordering by default.

Or you can manually specify an ordering for each factor, although I doubt that's what you want to do.

Regardless, sapply will generally yield an atomic vector, which will entail converting everything to characters in many cases. One way around this is as follows:

#Some test data
d <- data.frame(v1 = runif(10), v2 = letters[1:10], 
                v3 = rnorm(10), v4 = LETTERS[1:10],stringsAsFactors = TRUE)

d[4,] <- NA

#Similar function to DWin's answer          
fun <- function(x){
    if(is.numeric(x)){max(x,na.rm = 1)}
    else{max(as.character(x),na.rm=1)}
}   

#Use colwise from plyr package
colwise(fun)(d)
         v1 v2       v3 v4
1 0.8478983  j 1.999435  J
joran
  • 169,992
  • 32
  • 429
  • 468
  • Thanks for the detailed explanation, very helpful. stringsAsFactors = FALSE does make max() work as expected (but then I realized I actually wanted those fields to be factors; so casting the factors into strings when running max() works best for me). – Darren Cook Sep 06 '11 at 03:57
7

If you want to learn your data summary (df) provides the min, 1st quantile, median and mean, 3rd quantile and max of numerical columns and the frequency of the top levels of the factor columns.

Itamar
  • 2,111
  • 13
  • 16
  • Yes, with hindsight, I should've just used that that :-) It's output is a bit ugly (I wanted one field per row, with a column of minimums, a column of maximums, etc.) but I suppose I just have to track down how to reformat table objects. – Darren Cook Sep 06 '11 at 03:36
  • Another thing I would recommend is looking at the code from `summary()`. A lot of times I'll find a base function that does close to what I'm looking for and grab the general ideas for the code from there. – Rob Feb 08 '13 at 17:38
  • sadly, summary() is also not extensible. there is no easy way to add a mean function to it, for example. – ivo Welch Mar 04 '16 at 00:55
2

building on @ltamar's answer:
Use summary and munge the output into something useful!

library(tidyr)
library(dplyr)

df %>% 
  summary %>% 
  data.frame %>%
  select(-Var1) %>%
  separate(data=.,col=Freq,into = c('metric','value'),sep = ':') %>%
  rename(column_name=Var2) %>%
  mutate(value=as.numeric(value),
         metric = trimws(metric,'both') 
  ) %>%  
  filter(!is.na(value)) -> metrics

It's not pretty and it is certainly not fast but it gets the job done!

hibernado
  • 1,690
  • 1
  • 18
  • 19
2

The best way to do this is avoid base *apply functions, which coerces the entire data frame to an array, possibly losing information.

If you wanted to apply a function as.numeric to every column, a simple way is using mutate_all from dplyr:

t %>% mutate_all(as.numeric)

Alternatively use colwise from plyr, which will "turn a function that operates on a vector into a function that operates column-wise on a data.frame."

t %>% (colwise(as.numeric))

In the special case of reading in a data table of character vectors and coercing columns into the correct data type, use type.convert or type_convert from readr.


Less interesting answer: we can apply on each column with a for-loop:

for (i in 1:nrow(t)) { t[, i] <- parse_guess(t[, i]) }

I don't know of a good way of doing assignment with *apply while preserving data frame structure.

qwr
  • 9,525
  • 5
  • 58
  • 102
  • just note that `colwise` does not require the object to be an array to work (anymore), it requires the base type to be `data.frame`. – stucash Jan 04 '20 at 15:51
1

these days loops are just as fast so this is more than sufficient:

for (I in 1L:length(c(1,2,3))) {
    data.frame(c("1","2","3"),c("1","3","3"))[,I] <- 
    max(as.numeric(data.frame(c("1","2","3"),c("1","3","3"))[,I]))
}
0

A solution using retype() from hablar to coerce factors to character or numeric type depending on feasability. I'd use dplyr for applying max to each column.

Code

library(dplyr)
library(hablar)

# Retype() simplifies each columns type, e.g. always removes factors
d <- d %>% retype()

# Check max for each column
d %>% summarise_all(max)

Result

Not the new column types.

     v1 v2       v3 v4   
  <dbl> <chr> <dbl> <chr>
1 0.974 j      1.09 J   

Data

# Sample data borrowed from @joran
d <- data.frame(v1 = runif(10), v2 = letters[1:10], 
                v3 = rnorm(10), v4 = LETTERS[1:10],stringsAsFactors = TRUE)
davsjob
  • 1,882
  • 15
  • 10
0
df <- head(mtcars)
df$string <- c("a","b", "c", "d","e", "f"); df

my.min <- unlist(lapply(df, min))
my.max <- unlist(lapply(df, max))
Seyma Kalay
  • 2,037
  • 10
  • 22