32

If you use apply over rows on a data.frame with character and numeric columns, apply uses as.matrix internally to convert the data.frame to only characters. But if the numeric column consists of numbers of different lengths, as.matrix adds spaces to match the highest/"longest" number.

An example:

df <- data.frame(id1=c(rep("a",3)),id2=c(100,90,8), stringsAsFactors = FALSE) 
df
##   id1 id2
## 1   a 100
## 2   a  90
## 3   a   8
as.matrix(df)
##      id1 id2  
## [1,] "a" "100"
## [2,] "a" " 90"
## [3,] "a" "  8"

I would have expected the result to be:

     id1 id2  
[1,] "a" "100"
[2,] "a" "90"
[3,] "a" "8"

Why the extra spaces?

They can create unexpected results when using apply on a data.frame:

myfunc <- function(row){
  paste(row[1], row[2], sep = "")
}
> apply(df, 1, myfunc)
[1] "a100" "a 90" "a  8"
> 

While looping gives the expected result.

> for (i in 1:nrow(df)){
  print(myfunc(df[i,]))
}
[1] "a100"
[1] "a90"
[1] "a8"

and

> paste(df[,1], df[,2], sep = "")
[1] "a100" "a90"  "a8"  

Are there any situations where the extra spaces that are added with as.matrix is useful?

flstd
  • 323
  • 3
  • 6
  • Thanks for answers. I now have a better understanding of as.matrix and format and learned a few new tricks. I've updated my question, since I was also looking for a rationale behind the spaces, as they just seem to get in the way. – flstd Mar 25 '13 at 21:47
  • I ran into this exact issue when using `apply` which calls `as.matrix` internally. – qwr Jun 20 '19 at 21:48

5 Answers5

24

This is because of the way non-numeric data are converted in the as.matrix.data.frame method. There is a simple work-around, shown below.

Details

?as.matrix notes that conversion is done via format(), and it is here that the additional spaces are added. Specifically, ?as.matrix has this in the Details section:

 ‘as.matrix’ is a generic function.  The method for data frames
 will return a character matrix if there is only atomic columns and
 any non-(numeric/logical/complex) column, applying ‘as.vector’ to
 factors and ‘format’ to other non-character columns.  Otherwise,
 the usual coercion hierarchy (logical < integer < double <
 complex) will be used, e.g., all-logical data frames will be
 coerced to a logical matrix, mixed logical-integer will give a
 integer matrix, etc.

?format also notes that

Character strings are padded with blanks to the display width of the widest.

Consider this example which illustrates the behaviour

> format(df[,2])
[1] "100" " 90" "  8"
> nchar(format(df[,2]))
[1] 3 3 3

format doesn't have to work this way as it has trim:

trim: logical; if ‘FALSE’, logical, numeric and complex values are
      right-justified to a common width: if ‘TRUE’ the leading
      blanks for justification are suppressed.

e.g.

> format(df[,2], trim = TRUE)
[1] "100" "90"  "8"

but there is no way to pass this argument along to the as.matrix.data.frame method.

Workaround

A way to work around this is to apply format() yourself, manually, via sapply. There you can pass in trim = TRUE

> sapply(df, format, trim = TRUE)
     id1 id2  
[1,] "a" "100"
[2,] "a" "90" 
[3,] "a" "8"

or, using vapply we can state what we expect to be returned (here character vectors of length 3 [nrow(df)]):

> vapply(df, format, FUN.VALUE = character(nrow(df)), trim = TRUE)
     id1 id2  
[1,] "a" "100"
[2,] "a" "90" 
[3,] "a" "8"
Gavin Simpson
  • 170,508
  • 25
  • 396
  • 453
  • Presumably, the rationale here is that using `format` is simpler than doing different things for character and date columns, right? – joran Mar 25 '13 at 15:38
  • @joran `format` has many methods for different classes out-of-the-box. Hence it *is* doing something different for numeric and Date objects (due to method dispatch on `format`). Once it is determined that there are non-numeric data, the only solution is to produce a character matrix and `format` is the easiest way to do that. – Gavin Simpson Mar 25 '13 at 15:42
  • Would it not be possible to simply to change `format(xj)` to `format(xj,...)` in `as.matrix.data.frame`? This would allow us to pass `trim=TRUE` to `format`. – nograpes Mar 25 '13 at 15:48
  • Yeah, I know. I guess what I meant is that I'm not clear on why `format` would be preferred over `as.character` (which also has tons of methods out of the box). – joran Mar 25 '13 at 15:49
  • @Joran - I speculate that this is that `format` has far more methods in general, and possibly backwards compatibility with S (S-PLU)? – Gavin Simpson Mar 25 '13 at 15:53
  • @nograpes Not without editing it in the sources and compiling yourself, or updating the method at the start of every sessions with an suitably located `assignInNamespace` call. – Gavin Simpson Mar 25 '13 at 15:54
  • @GavinSimpson Right. I was thinking that this would be a nice feature addition in base. Maybe I'll ask for it on the mailing list. – nograpes Mar 25 '13 at 15:55
  • @nograpes Oh right - well yes, that would be the way to go. Won't happen for a while though, even if someone on R Core bites, as they are in freeze for R 3.0.0. Make sure you provide a good use-case when you do suggest. Perhaps sound out on R-Devel first as there may be unintended side effects of `...` passing. – Gavin Simpson Mar 25 '13 at 15:57
9

It does seem a little strange. In the manual (?as.matrix) it explains that format is called for the conversion to a character matrix:

The method for data frames will return a character matrix if there is only atomic columns and any non-(numeric/logical/complex) column, applying as.vector to factors and format to other non-character columns.

And you can see that if you call format directly, it does what as.matrix does:

format(df$id2)
[1] "100" " 90" "  8"

What you need to do is pass the trim arugment:

format(df$id2,trim=TRUE)
[1] "100" "90"  "8" 

But, unfortunately, the as.matrix.data.frame function doesn't allow you to do that.

else if (non.numeric) {
    for (j in pseq) {
        if (is.character(X[[j]])) 
            next
        xj <- X[[j]]
        miss <- is.na(xj)
        xj <- if (length(levels(xj))) 
            as.vector(xj)
        else format(xj) # This could have ... as an argument
        # else format(xj,...)
        is.na(xj) <- miss
        X[[j]] <- xj
    }
}

So, you could modify as.data.frame.matrix. I think it would be a nice feature addition, however, to include this in base.

But, a quick solution would be to simply:

as.matrix(data.frame(lapply(df,as.character)))
     id1 id2  
[1,] "a" "100"
[2,] "a" "90" 
[3,] "a" "8"  
# As mentioned in the comments, this also works:
sapply(df,as.character)
Gavin Simpson
  • 170,508
  • 25
  • 396
  • 453
nograpes
  • 18,623
  • 1
  • 44
  • 67
  • 1
    +1 for the work-around. Note it can be simplified to `sapply(df, format, trim = TRUE)` given the nature of the simplifications that `sapply` does. To be extra certain, you could use `vapply` instead and specify the type of returned objects. – Gavin Simpson Mar 25 '13 at 15:47
  • 1
    the `as.matrix()` is fully redundant here - `sapply` is returning a matrix. Try: `class(sapply(df,as.character))` – Gavin Simpson Mar 25 '13 at 15:55
6

as.matrix calls format internally:

 > format(df$id2)
[1] "100" " 90" "  8"

That's where the extra spaces come from. format has an extra argument trim to remove those:

> format(df$id2, trim = TRUE)
[1] "100" "90"  "8"  

However you cannot supply this argument to as.matrix.

Gavin Simpson
  • 170,508
  • 25
  • 396
  • 453
EDi
  • 13,160
  • 2
  • 48
  • 57
1

The reason for this behaviour is already explained in previous answers, but I'd like to offer another way of circumventing this:

df <- data.frame(id1=c(rep("a",3)),id2=c(100,90,8), stringsAsFactors = FALSE) 
do.call(cbind,df)
     id1 id2  
[1,] "a" "100"
[2,] "a" "90" 
[3,] "a" "8"  

Note that if using stringsAsFactors = TRUE, this doesn't work as factor levels are converted to numbers.

Jouni Helske
  • 6,427
  • 29
  • 52
0

Just another solution: trimWhiteSpace(x) (from limma R pckg) also does the job if you don't mind downloading the package.

source("https://bioconductor.org/biocLite.R")
biocLite("limma")
library(limma)
df <- data.frame(id1=c(rep("a",3)),id2=c(100,90,8), stringsAsFactors = FALSE) 
as.matrix(df)
 id1 id2  
[1,] "a" "100"
[2,] "a" " 90"
[3,] "a" "  8"

trimWhiteSpace(as.matrix(df))
 id1 id2  enter code here
[1,] "a" "100"
[2,] "a" "90" 
[3,] "a" "8"