0

I want to bin the numeric variables in a dataframe, please have a look at my example code:

x <- -10:10
y <- x^2
parab <- data.frame(x, y)
str(parab)
## 'data.frame':    21 obs. of  2 variables:
##  $ x: int  -10 -9 -8 -7 -6 -5 -4 -3 -2 -1 ...
##  $ y: num  100 81 64 49 36 25 16 9 4 1 ...
cut(parab$x, 3) #works as expected
##  [1] (-10,-3.33]  (-10,-3.33]  (-10,-3.33]  (-10,-3.33]  (-10,-3.33] 
##  [6] (-10,-3.33]  (-10,-3.33]  (-3.33,3.33] (-3.33,3.33] (-3.33,3.33]
## [11] (-3.33,3.33] (-3.33,3.33] (-3.33,3.33] (-3.33,3.33] (3.33,10]   
## [16] (3.33,10]    (3.33,10]    (3.33,10]    (3.33,10]    (3.33,10]   
## [21] (3.33,10]   
## Levels: (-10,-3.33] (-3.33,3.33] (3.33,10]
apply(parab, 2, function(x) cut(x, 3)) #works as expected
##       x              y            
##  [1,] "(-10,-3.33]"  "(66.7,100]" 
##  [2,] "(-10,-3.33]"  "(66.7,100]" 
##  [3,] "(-10,-3.33]"  "(33.3,66.7]"
##  [4,] "(-10,-3.33]"  "(33.3,66.7]"
##  [5,] "(-10,-3.33]"  "(33.3,66.7]"
##  [6,] "(-10,-3.33]"  "(-0.1,33.3]"
##  [7,] "(-10,-3.33]"  "(-0.1,33.3]"
##  [8,] "(-3.33,3.33]" "(-0.1,33.3]"
##  [9,] "(-3.33,3.33]" "(-0.1,33.3]"
## [10,] "(-3.33,3.33]" "(-0.1,33.3]"
## [11,] "(-3.33,3.33]" "(-0.1,33.3]"
## [12,] "(-3.33,3.33]" "(-0.1,33.3]"
## [13,] "(-3.33,3.33]" "(-0.1,33.3]"
## [14,] "(-3.33,3.33]" "(-0.1,33.3]"
## [15,] "(3.33,10]"    "(-0.1,33.3]"
## [16,] "(3.33,10]"    "(-0.1,33.3]"
## [17,] "(3.33,10]"    "(33.3,66.7]"
## [18,] "(3.33,10]"    "(33.3,66.7]"
## [19,] "(3.33,10]"    "(33.3,66.7]"
## [20,] "(3.33,10]"    "(66.7,100]" 
## [21,] "(3.33,10]"    "(66.7,100]"
apply(parab, 2, function(x) if(is.numeric(x)) cut(x, 3) else x) #works as expected
##       x              y            
##  [1,] "(-10,-3.33]"  "(66.7,100]" 
##  [2,] "(-10,-3.33]"  "(66.7,100]" 
##  [3,] "(-10,-3.33]"  "(33.3,66.7]"
##  [4,] "(-10,-3.33]"  "(33.3,66.7]"
##  [5,] "(-10,-3.33]"  "(33.3,66.7]"
##  [6,] "(-10,-3.33]"  "(-0.1,33.3]"
##  [7,] "(-10,-3.33]"  "(-0.1,33.3]"
##  [8,] "(-3.33,3.33]" "(-0.1,33.3]"
##  [9,] "(-3.33,3.33]" "(-0.1,33.3]"
## [10,] "(-3.33,3.33]" "(-0.1,33.3]"
## [11,] "(-3.33,3.33]" "(-0.1,33.3]"
## [12,] "(-3.33,3.33]" "(-0.1,33.3]"
## [13,] "(-3.33,3.33]" "(-0.1,33.3]"
## [14,] "(-3.33,3.33]" "(-0.1,33.3]"
## [15,] "(3.33,10]"    "(-0.1,33.3]"
## [16,] "(3.33,10]"    "(-0.1,33.3]"
## [17,] "(3.33,10]"    "(33.3,66.7]"
## [18,] "(3.33,10]"    "(33.3,66.7]"
## [19,] "(3.33,10]"    "(33.3,66.7]"
## [20,] "(3.33,10]"    "(66.7,100]" 
## [21,] "(3.33,10]"    "(66.7,100]"
apply(parab, 2, function(x) ifelse(T, cut(x, 3), T)) #does not work!
## x y 
## 1 3
parab$z <- rep("test", length(x))
str(parab)
## 'data.frame':    21 obs. of  3 variables:
##  $ x: int  -10 -9 -8 -7 -6 -5 -4 -3 -2 -1 ...
##  $ y: num  100 81 64 49 36 25 16 9 4 1 ...
##  $ z: chr  "test" "test" "test" "test" ...
apply(parab, 2, function(x) if(is.numeric(x)) cut(x, 3) else x) #does not work anymore?!?
##       x     y     z     
##  [1,] "-10" "100" "test"
##  [2,] " -9" " 81" "test"
##  [3,] " -8" " 64" "test"
##  [4,] " -7" " 49" "test"
##  [5,] " -6" " 36" "test"
##  [6,] " -5" " 25" "test"
##  [7,] " -4" " 16" "test"
##  [8,] " -3" "  9" "test"
##  [9,] " -2" "  4" "test"
## [10,] " -1" "  1" "test"
## [11,] "  0" "  0" "test"
## [12,] "  1" "  1" "test"
## [13,] "  2" "  4" "test"
## [14,] "  3" "  9" "test"
## [15,] "  4" " 16" "test"
## [16,] "  5" " 25" "test"
## [17,] "  6" " 36" "test"
## [18,] "  7" " 49" "test"
## [19,] "  8" " 64" "test"
## [20,] "  9" " 81" "test"
## [21,] " 10" "100" "test"

My questions

  1. Why do you have to use if and else instead of ifelse (I think it has to do with ifelse being vectorized?) ...and more importantly
  2. Why does the cut function stop working when another column is not numeric? How can I remedy the situation to get it functional again?
rawr
  • 20,481
  • 4
  • 44
  • 78
vonjd
  • 4,202
  • 3
  • 44
  • 68

1 Answers1

2

Your problems have nothing to do with cut and more to do with the ifelse and apply functions.

ifelse only returns a result that is the same length as the input so when you do

ifelse(T, cut(x, 3), T)

the input is only length 1 and hence why you only get a result of length 1 for each column.

The other issue you have is with understanding the process by which apply works. From the apply documentation:

 If ‘X’ is not an array but an object of a class with a non-null
 ‘dim’ value (such as a data frame), ‘apply’ attempts to coerce it
 to an array via ‘as.matrix’ if it is two-dimensional (e.g., a data
 frame) or via ‘as.array’.

You added a non-numeric column to your data.frame. When using apply it's going to cast your data.frame to a matrix first. Matrices can only be a single type and character is chosen over numeric. So now what you thought was numbers is now characters so it's taking the 'else' branch in your if/else statement since the input isn't numeric.

To do what you want you can use:

parab[] <- lapply(parab, function(x) if(is.numeric(x)) cut(x, 3) else x)

(Thanks to @PierreLafortune for this version)

Dason
  • 60,663
  • 9
  • 131
  • 148
  • Thank you, this is helpful - how can I make it work? – vonjd Feb 26 '16 at 18:05
  • 1
    Use `parab[] <- lapply(parab, function(x) if(is.numeric(x)) cut(x, 3) else x)` – Pierre L Feb 26 '16 at 18:24
  • `as.data.frame(lapply(parab, function(x) if(is.numeric(x)) cut(x, 3) else x))` does also work - Thank you all! – vonjd Feb 26 '16 at 18:27
  • vonjd one issue with that is it converts the characters to factors by default (which is why I previously specified stringsAsFactors=FALSE in my call to as.data.frame). @PierreLafortune's version is nicer. I always forget about dataframe[] <- lapply(input, func). – Dason Feb 26 '16 at 18:28
  • Thank you, I think I will never forget it again because this is a very useful function to apply a function columnwise to a dataframe – vonjd Feb 26 '16 at 18:36