1

I have some issues when I try to convert my numerical variable into a categorical one. I want to have my column "Price" divided into 20 bins (in order to do a classification tree then).

I tried with the function cut, and it worked, but my intervals are expressed in scientific notation ...

Here is a sample of my data:

Mydata <- data.frame(
Price = c(13500,13750,13950,14950,13750,12950)
)

Here is my code :

Mydata[,2] = cut(Mydata$Price, 3, include.lowest=TRUE)

Then, my 2nd colonne have numbers like (3.11e+04,3.25e+04] for example. I also do with the argument labels = FALSE, but this is not what I'm looking for (then, the bins are expressed in numbers -> 1,2,3, ...,20. I want them to be expressed in intervals -> [0;1000], [1000, 2000], etc...)

Thanks in advance for your help

Jay
  • 43
  • 1
  • 5
  • You should provide reproducible example data as described [here](https://stackoverflow.com/questions/5963269/how-to-make-a-great-r-reproducible-example) – tobiasegli_te Nov 08 '17 at 17:08
  • If your goal is a classification try, why not let the tree fitting algorithm determine optimal cutpoints by giving it a numeric variable? – Gregor Thomas Nov 08 '17 at 17:08
  • Sorry - I've edited my question. This is for an exercise at school, and they ask us to create a new variable that categorizes the price into 20 bins ... I'm not sure that the tree fitting code in R can divide my data into 20 bins, as the algorithm stops at some point – Jay Nov 08 '17 at 17:27

2 Answers2

2

I found a solution ! With dig.lab

Mydata <- data.frame(
Price = c(13500,13750,13950,14950,13750,12950)
)

Here is my code :

Mydata[,2] = cut(Mydata$Price, 3, include.lowest=TRUE, dig.lab = 5)

Thanks anyway for your tips :)

Jay
  • 43
  • 1
  • 5
0

If your object Mydata has class matrix, then I have a clue to what may be going on:

The cut() function returns a factor that looks like this:

> x <- runif(10, 0, 2)
> cut(x, 2)
 [1] (1.01,1.95]  (1.01,1.95]  (0.069,1.01] (1.01,1.95]  (1.01,1.95]  
 (1.01,1.95]  (1.01,1.95]  (1.01,1.95]  (1.01,1.95] 
 [10] (1.01,1.95] 
 Levels: (0.069,1.01] (1.01,1.95]

The cut() function is naturally returning a set of intervals (as you requested). The output is a factor. This is important. Now watch what happens when I force the output to be numeric:

> as.numeric(cut(x, 2))
[1] 2 2 1 2 2 2 2 2 2 2

That's a numeric vector. Why does this matter? Because objects of class matrix in R can only have one type. Any new values added to a matrix will be coerced to the type of the rest of the entries in the matrix. Watch:

> X_mat <- matrix(1:10L, nrow = 10, ncol = 2)
> X_mat[, 2] <- cut(x, 2)
> X_mat
      [,1] [,2]
 [1,]    1    2
 [2,]    2    2
 [3,]    3    1
 [4,]    4    2
 [5,]    5    2
 [6,]    6    2
 [7,]    7    2
 [8,]    8    2
 [9,]    9    2
[10,]   10    2

The intervals returned by the cut() function are gone, because they were transformed to class numeric to match the rest of the matrix X_mat. What if we use a data frame instead?

> X_df <- data.frame(x1 = 1:10L)
> X_df[, 2] <- cut(x, 2)
> X_df
  x1           V2
1   1  (1.01,1.95]
2   2  (1.01,1.95]
3   3 (0.069,1.01]
4   4  (1.01,1.95]
5   5  (1.01,1.95]
6   6  (1.01,1.95]
7   7  (1.01,1.95]
8   8  (1.01,1.95]
9   9  (1.01,1.95]
10 10  (1.01,1.95] 

Basically, if you want to preserve the structure of the output from cut(), your data needs to be in a data.frame instead of a matrix. Hope this helps!

Gabriel J. Odom
  • 336
  • 2
  • 9
  • Thanks, I've just edited my question to give a sample of my data as an example, and it is a data.frame – Jay Nov 08 '17 at 17:34