0

I have a set of transaction values whose range are 0-15000 USD. I've plotted a histogram specifying breaks of $250 bin values, which is helpful. What I would like to do is go back into the dataframe and create my own bin values within the data frame. The bins would specify the range that the transactions fall into, such as: 0-250, 251-499, 500-749, 750...by 250 all the way up to 15,000.

I looked at this nifty post Generate bins from a data frame regarding 'cut' and 'findInterval' but they aren't really meeting my expectations. It's either nasty factors that looks okay for low bin ranges but once I get above $x,000 I get e-values (1.27e+04, 1.3e04).

What I'd like is:

Tran ID   Amount     Bin
135       $249.22    0-250
138      $1,022.01   1000-1249
155      $10,350.11  10,249-10,500

Is this possible with 'cut' or 'findInterval' or is there a better implementation?

Community
  • 1
  • 1
Zach
  • 1,316
  • 2
  • 14
  • 21

1 Answers1

2

cut is the way to go for this problem. If you do not like the output with the brackets, you can use some data manipulation to get it to look the way you'd like.

bins <- seq(0, 15000, by=250)
Amount2 <- as.numeric(gsub("\\$|,", "", df$Amount))
labels <- gsub("(?<!^)(\\d{3})$", ",\\1", bins, perl=T)
rangelabels <- paste(head(labels,-1), tail(labels,-1), sep="-")
df$Bin <- cut(Amount2, bins, rangelabels)

We first create a sequence from 0 to 15,000 by 250. Next we format the Amount column by eliminating the dollar signs and commas and save to the variable Amount2. We then format the output labels by inserting commas after the first three digits. We will use that variable in the final Bin column.

The variable rangelabels combines the bin break-points with a hyphen. The main function is next, cut(Amount2, bins, rangelabels). The first argument, Amount2 is the data frame vector being cut. The second argument, bins supplies the breaks for the intervals. The last argument, rangelabels is the vector of names for the output resulting in:

df
  TranID     Amount           Bin
1    135    $249.22         0-250
2    138  $1,022.01   1,000-1,250
3    155 $10,350.11 10,250-10,500
Pierre L
  • 28,203
  • 6
  • 47
  • 69
  • 1
    The string manipulation was the way to go on specifying bin values as a stand alone vector! Independently, I was able to get the bins into the value ranges I wanted using {df.tn.amount$cuts <- cut(df.tn.amount$amount, seq(from = 0, to = 15000, by = 250))} -- brilliant. Another regex triumph! – Zach Oct 06 '15 at 18:47