Define and apply custom bins on a dataframe

Question

Using python I have created following data frame which contains similarity values:

  cosinFcolor cosinEdge cosinTexture histoFcolor histoEdge histoTexture    jaccard
1       0.770     0.489        0.388  0.57500000 0.5845137    0.3920000 0.00000000
2       0.067     0.496        0.912  0.13865546 0.6147309    0.6984127 0.00000000
3       0.514     0.426        0.692  0.36440678 0.4787535    0.5198413 0.05882353
4       0.102     0.430        0.739  0.11297071 0.5288008    0.5436508 0.00000000
5       0.560     0.735        0.554  0.48148148 0.8168083    0.4603175 0.00000000
6       0.029     0.302        0.558  0.08547009 0.3928234    0.4603175 0.00000000

I am trying to write a R script to generate another data frame that reflects the bins, but my condition of binning applies if the value is above 0.5 such that

Pseudocode:

if (cosinFcolor > 0.5 & cosinFcolor <= 0.6)
   bin = 1
if (cosinFcolor > 0.6 & cosinFcolor <= 0.7)
   bin = 2
if (cosinFcolor > 0.7 & cosinFcolor =< 0.8)
   bin = 3
if (cosinFcolor > 0.8 & cosinFcolor <=0.9)
   bin = 4
if (cosinFcolor > 0.9 & cosinFcolor <= 1.0)
   bin = 5
else
   bin = 0

Based on above logic, I want to build a data frame

  cosinFcolor cosinEdge cosinTexture histoFcolor histoEdge histoTexture    jaccard
1       3         0         0            1           1        0               0

How can I start this as a script, or should I do this in python? I am trying to get familiar with R after finding out how powerful it is/number of machine learning packages it has. My goal is to build a classifier but first I need be familiar with R :)

For those open to a `data.table` approach, I wrote a flexible [bin_data()](https://github.com/ben519/mltools/blob/master/R/bin_data.R) method which I described in [this answer](http://stackoverflow.com/a/39560604/2146894). — Ben, Sep 18 '16 at 17:32
Looks like you want to apply the exact same bins to all 7 columns, not just `cosinFcolor` — smci, Sep 16 '18 at 11:30

score 60 · Accepted Answer · edited Dec 25 '22 at 01:42

Another cut answer that takes into account extrema:

dat <- read.table("clipboard", header=TRUE)

cuts <- apply(dat, 2, cut, c(-Inf,seq(0.5, 1, 0.1), Inf), labels=0:6)
cuts[cuts=="6"] <- "0"
cuts <- as.data.frame(cuts)

  cosinFcolor cosinEdge cosinTexture histoFcolor histoEdge histoTexture jaccard
1           3         0            0           1         1            0       0
2           0         0            5           0         2            2       0
3           1         0            2           0         0            1       0
4           0         0            3           0         1            1       0
5           1         3            1           0         4            0       0
6           0         0            1           0         0            0       0

Explanation

The cut function splits into bins depending on the cuts you specify. So let's take 1:10 and split it at 3, 5 and 7.

cut(1:10, c(3, 5, 7))
 [1] <NA>  <NA>  <NA>  (3,5] (3,5] (5,7] (5,7] <NA>  <NA>  <NA> 
Levels: (3,5] (5,7]

You can see how it has made a factor where the levels are those in between the breaks. Also notice it doesn't include 3 (there's an include.lowest argument which will include it). But these are terrible names for groups, let's call them group 1 and 2.

cut(1:10, c(3, 5, 7), labels=1:2)
 [1] <NA> <NA> <NA> 1    1    2    2    <NA> <NA> <NA>

Better, but what's with the NAs? They are outside our boundaries and not counted. To count them, in my solution, I added -infinity and infinity, so all points would be included. Notice that as we have more breaks, we'll need more labels:

x <- cut(1:10, c(-Inf, 3, 5, 7, Inf), labels=1:4)
 [1] 1 1 1 2 2 3 3 4 4 4
Levels: 1 2 3 4

Ok, but we didn't want 4 (as per your problem). We wanted all the 4s to be in group 1. So let's get rid of the entries which are labelled '4'.

x[x=="4"] <- "1"
 [1] 1 1 1 2 2 3 3 1 1 1
Levels: 1 2 3 4

This is slightly different to what I did before, notice I took away all the last labels at the end before, but I've done it this way here so you can better see how cut works.

Ok, the apply function. So far, we've been using cut on a single vector. But you want it used on a collection of vectors: each column of your data frame. That's what the second argument of apply does. 1 applies the function to all rows, 2 applies to all columns. Apply the cut function to each column of your data frame. Everything after cut in the apply function are just arguments to cut, which we discussed above.

Any chance you could explain the things that you are doing, I would love to get the logic around and really learn it rather than just coping it. — add-semi-colons, Aug 15 '12 at 19:10
What if I the buckets do not follow a precise sequence? What if these are custom buckets within another dataframe? — BlackHat, May 19 '15 at 19:27
@user3116753 The sequence was just for example. In my explanation, you'll see I've used custom splits. — sebastian-c, Aug 10 '15 at 05:25

score 26 · Answer 2 · edited Aug 15 '12 at 06:02

26

You can also use findInterval:

findInterval(seq(0, 1, l=20), seq(0.5, 1, by=0.1))

## [1] 0 0 0 0 0 0 0 0 0 1 1 2 2 3 4 4 5 5

edited Aug 15 '12 at 06:02

sebastian-c

15,057
3
47
93

answered Aug 15 '12 at 03:14

mnel

113,303
27
265
254

2

Yes. A very useful function. Lets you avoid creating messy factors with cut(). – IRTFM Aug 30 '12 at 21:15
2

You don't have to have a messy factor with cut. You can set labels = False to get integer codes rather than factors, but without sacrificing the flexibility that cut() affords you. – dsh Apr 27 '17 at 23:58

score 16 · Answer 3 · answered Aug 15 '12 at 03:09

With cut it's easy as pie

dtf <- read.table(
textConnection(
"cosinFcolor cosinEdge cosinTexture histoFcolor histoEdge histoTexture jaccard
1 0.770 0.489 0.388 0.57500000 0.5845137 0.3920000 0.00000000
2 0.067 0.496 0.912 0.13865546 0.6147309 0.6984127 0.00000000
3 0.514 0.426 0.692 0.36440678 0.4787535 0.5198413 0.05882353
4 0.102 0.430 0.739 0.11297071 0.5288008 0.5436508 0.00000000
5 0.560 0.735 0.554 0.48148148 0.8168083 0.4603175 0.00000000
6 0.029 0.302 0.558 0.08547009 0.3928234 0.4603175 0.00000000"), sep = " ", 
           header = TRUE)

dtf$bin <- cut(dtf$cosinFcolor, breaks = c(0, seq(0.5, 1, by = .1)), labels = 0:5)
dtf
  cosinFcolor cosinEdge cosinTexture histoFcolor histoEdge histoTexture    jaccard bin
1       0.770     0.489        0.388  0.57500000 0.5845137    0.3920000 0.00000000   3
2       0.067     0.496        0.912  0.13865546 0.6147309    0.6984127 0.00000000   0
3       0.514     0.426        0.692  0.36440678 0.4787535    0.5198413 0.05882353   1
4       0.102     0.430        0.739  0.11297071 0.5288008    0.5436508 0.00000000   0
5       0.560     0.735        0.554  0.48148148 0.8168083    0.4603175 0.00000000   1
6       0.029     0.302        0.558  0.08547009 0.3928234    0.4603175 0.00000000   0

score 2 · Answer 4 · answered Jul 13 '17 at 03:51

Here's another solution using the bin_data() function from the mltools package.

Binning one vector

library(mltools)

cosinFcolor <- c(0.77, 0.067, 0.514, 0.102, 0.56, 0.029)
binned <- bin_data(cosinFcolor, bins=c(0, 0.5, 0.6, 0.7, 0.8, 0.9, 1.0), boundaryType = "[lorc")

binned
[1] (0.7, 0.8] [0, 0.5]   (0.5, 0.6] [0, 0.5]   (0.5, 0.6] [0, 0.5]  
Levels: [0, 0.5] < (0.5, 0.6] < (0.6, 0.7] < (0.7, 0.8] < (0.8, 0.9] < (0.9, 1]

# Convert to numbers 0, 1, ...
as.integer(binned) - 1L

Binning each column in the data.frame

df <- read.table(textConnection(
  "cosinFcolor cosinEdge cosinTexture histoFcolor histoEdge histoTexture jaccard
0.770 0.489 0.388 0.57500000 0.5845137 0.3920000 0.00000000
0.067 0.496 0.912 0.13865546 0.6147309 0.6984127 0.00000000
0.514 0.426 0.692 0.36440678 0.4787535 0.5198413 0.05882353
0.102 0.430 0.739 0.11297071 0.5288008 0.5436508 0.00000000
0.560 0.735 0.554 0.48148148 0.8168083 0.4603175 0.00000000
0.029 0.302 0.558 0.08547009 0.3928234 0.4603175 0.00000000"
), sep = " ", header = TRUE)

for(col in colnames(df)) df[[col]] <- as.integer(bin_data(df[[col]], bins=c(0, 0.5, 0.6, 0.7, 0.8, 0.9, 1.0), boundaryType = "[lorc")) - 1L

df
  cosinFcolor cosinEdge cosinTexture histoFcolor histoEdge histoTexture jaccard
1           3         0            0           1         1            0       0
2           0         0            5           0         2            2       0
3           1         0            2           0         0            1       0
4           0         0            3           0         1            1       0
5           1         3            1           0         4            0       0
6           0         0            1           0         0            0       0

"lorc" stands for "left-open right-closed" indicating the boundary type of each bin. The "[" on the far left means "make the left most bin left-closed". See `?bin_data` for some examples. — Ben, Jan 28 '18 at 04:01
thx. is there a way to just say for a given dataframe: bin every numerical valued column into K bins? (maybe I should ask this as a standalone question..) — WestCoastProjects, Jan 28 '18 at 04:13
Do you mean like this? `df <- iris[, c("Sepal.Length", "Sepal.Width", "Petal.Length", "Petal.Width")]; bin_data(unlist(df), bins = 5)` — Ben, Jan 28 '18 at 04:18

Define and apply custom bins on a dataframe

4 Answers4

Explanation

Binning one vector

Binning each column in the data.frame

Linked

Related