0

I am using the R programming language. On some bigger data, I tried the following code (make a decision tree):

#load library
library(rpart)
    
    #generate data
    a = rnorm(100, 7000000, 10)
    
    b = rnorm(100, 5000000, 5)
    
    c = rnorm(100, 400000, 10)
    
    group <- sample( LETTERS[1:2], 100, replace=TRUE, prob=c(0.5,0.5) )
    
    group_1 <- sample( LETTERS[1:4], 100, replace=TRUE, prob=c(0.25, 0.25, 0.25, 0.25) )
    
    
    d = data.frame(a,b,c, group, group_1)
    d$group = as.factor(d$group)
    d$group_1 = as.factor(d$group_1)
    
#fit model
    tree <- rpart(group ~ ., d)
    
#visualize results
    plot(tree)
    
    text(tree, use.n=TRUE, minlength = 0, xpd=TRUE, cex=.8)

In the visual output, the numbers are displayed in scientific notation (e.g. 4.21e+06). Is there a way to disable this?

I consulted this previous answer on stackoverflow:How to disable scientific notation?

I then tried the following command : options(scipen=999)

But this did not seem to fix the problem.

Can someone please tell me what I am doing wrong?

Thanks

stats_noob
  • 5,401
  • 4
  • 27
  • 83

1 Answers1

2

I think the labels.rpart function has scientific notation hard-coded in: it uses a private function called formatg to do the formatting using sprintf() with a %g format, and that function ignores options(scipen). You can override this by replacing formatg with a better function. Here's a dangerous way to do that:

oldformatg <- rpart:::formatg
assignInNamespace("formatg", format, "rpart")

which replaces formatg with the standard format function. (This will definitely have dangerous side effects, so afterwards you should change it back using

assignInNamespace("formatg", oldformatg, "rpart")

A better solution would be to rescale your data. rpart switches to scientific notation only for big numbers, so you could divide the bad numbers by something like 1000 or 1000000, and describe them as being in different units. For your example, this works for me:

library(rpart)

#generate data
set.seed(123)
a = rnorm(100, 7000000, 10)/1000

b = rnorm(100, 5000000, 5)/1000

c = rnorm(100, 400000, 10)/1000

group <- sample( LETTERS[1:2], 100, replace=TRUE, prob=c(0.5,0.5) )

group_1 <- sample( LETTERS[1:4], 100, replace=TRUE, prob=c(0.25, 0.25, 0.25, 0.25) )


d = data.frame(a,b,c, group, group_1)
d$group = as.factor(d$group)
d$group_1 = as.factor(d$group_1)

#fit model
tree <- rpart(group ~ ., d)

#visualize results
plot(tree)

text(tree, use.n=TRUE, minlength = 0, xpd=TRUE, cex=.8)

Created on 2021-01-27 by the reprex package (v0.3.0)

user2554330
  • 37,248
  • 4
  • 43
  • 90
  • Thank you for your reply! I am reading and trying to understand your answer. Why did you describe it as a "dangerous way"? – stats_noob Jan 27 '21 at 16:25
  • I would prefer not to divide the numbers. Is there a non-dangerous way to surpress the scientific notation? Thank you for your help – stats_noob Jan 27 '21 at 16:26
  • It is dangerous to modify code that doesn't belong to you, because it may be used in ways you aren't taking into account. As far as I can see, only the maintainer of the `rpart` package could safely make the change you want. `rpart` source is here: https://github.com/bethatkinson/rpart ; you could leave a message asking them to respect the `scipen` setting. – user2554330 Jan 27 '21 at 16:59