Binning different lengths in R

Question

input1

dput(a1  100 200 +
a1  250 270 +
a1  333 340 -
a2  450 460 +)

input2

dput(a1  101 106 +
a1  112 117 +
a1  258 259 +
a1  258 259 +
a1  258 259 +
a1  258 259 +
a1  258 259 +
a1  258 259 +
a1  258 259 +
a1  258 259 +
a1  258 259 +
a1  260 262 +
a1  260 262 + 
a1  260 262 + 
a1  260 262 + 
a1  260 262 + 
a1  332 333 -
a1  332 333 -
a1  332 333 -
a1  332 333 -
a1  332 333 -
a1  332 333 -
a1  332 333 -
a1  331 333 -
a1  331 333 -
a1  331 333 -
a1  331 333 -
a1  331 333 -
a1  331 333 -)

output

c   s   e   st  1   2   3   4   5   6   7   8   9   10
a1  100 200 +   1   2   0   0   0   0   0   0   0   0
a1  250 270 +   0   0   0   9   5   0   0   0   0   0
a1  330 340 -   0   0   0   0   0   0   0   6   7   0
a2  450 460 +   0   0   0   0   0   0   0   0   0   0

I want to count density of points (input2) using input1 values. Means that a1-100-200 has how many points in this 100 to 200 range?. i.e. 3. And I want to do the same for all the input values. And I want to compare each other. But the problem is that the length of values (200-100=100 or 270-250=20) are different. In order to compare them against each other I need to scale them in a way that I can compare. So I came up with 10 bins window (output). I count the input2 points using input1 bins. Finally I need to plot bins on x-axis and values on y axis xyplot(x(bins),y1(a1:100:200:+)+y2(a1:250:270:+y3...+y4)

"+" means we need to take 100 as start point and 200 as end point when we calculate bins (100-110 will be 1st bin .....) - means exactly opposite (190-200 will be the first bin )

1-10 means 1 to 10 bins

you need to use column 1 and 2 based on column1 key for bins. We remove th values the are not in range

c = character, s =start, e=end, s=strand, 1-10 are bins of input1. yes you are right abt binning. For example 250-270 should have 2 numbers difference because (270-250=20, therefore for for 10 bins it would be 20/10=2)

What does the column with `+` and `-` mean? It doesn't seem related to the question. Also, making your data reproducible with `dput` would be helpful. — Richie Cotton, Aug 04 '11 at 12:37
It is not clear to me what the columns labeled 1-10 are supposed to represent in your output. For that matter, what do c,s,e,st represent as well? Why do you end up with counts in the columns labeled 1-10? — Chase, Aug 04 '11 at 12:39
Two more things: Why does `input2` have more than one column? What do you want to do with values in `input2` that don't fall into the ranges specified in `input1`, e.g., what to do with a value of 225? — Richie Cotton, Aug 04 '11 at 12:42
Further, which columns do you want to use for your bins for input2, the first or the second? What do you do with data points that are out of range on either the first or the second...i.e. the rows with `331 333`. 331 does not fall within one of the ranges in input1, while 333 does. — Chase, Aug 04 '11 at 12:43
@ Cotton = + means we need to take 100 as start point and 200 as end point when we calculate bins (100-110 will be 1st bin .....) - means exactly opposite (190-200 will be the first bin ) — repinementer, Aug 04 '11 at 12:43
@chase = you need to use column 1 and 2 based on column1 key for bins. We remove th values the are not in range. — repinementer, Aug 04 '11 at 12:47
Still ambiguous. 1-10 bins within what? The categories outlined in input1? So does that mean the binwidth for `100 200` should be `100 - 110, 111 - 120, 121 - 130, ...` and the bins for `250 270` should be `250 - 251, 252 - 253, 254 - 255, ...`? Details on this information at the front end will make giving you reasonable answers infinitely easier. Also, please heed Richie's suggestions of adding the results of `dput(input1)` and `dput(input2` to your question. — Chase, Aug 04 '11 at 12:50
c = character, s =start, e=end, s=strand, 1-10 are bins of input1. yes you are right abt binning. but 250-270 should have 2 numbers difference because (270-250=20, therefore for for 10 bins it would be 20/10=2) — repinementer, Aug 04 '11 at 12:50
Can we take a big step back here? Can you edit your question to include the relevant details of these 10 comments? No reasonable question requires someone to read through 10 comments to gain a semblance of understanding about what the problem is. I think there is a good question hiding in here, but it's not coming through in the current presentation. Help the SO community help you. This post on SO offers brilliant insight into asking good questions: http://stackoverflow.com/questions/5963269/how-to-make-a-great-r-reproducible-example — Chase, Aug 04 '11 at 12:54
I'm quiet new here. Did I re-posted the question correctly :( ? — repinementer, Aug 04 '11 at 13:06
@repinemeter: Um, no. In R, type `dput(input1)`. The output should be a structure of some sort. In your question, type `input1 <- `. (Then CTRL+K to format it as code.) — Richie Cotton, Aug 04 '11 at 13:39
@repinemeter: Also, let me know if I was anywhere close with my answer. You still need to try and rewrite the question to make it clearer what you want. Adding samples of code that you've tried can be helpful. Don't worry if you're finding it hard to get your message across. Writing a good question takes lots of practise. — Richie Cotton, Aug 04 '11 at 13:42
The simplest way of describing the question may be like this. input1 has 1 to 100 random numbers. input 2 has a range 10-20. I want to calculate how may random numbers that fall in 10-20 range. In my question the input has different length of ranges like 1-10, 20-25 and 80-100. In this case calculating random numbers that fall in these different ranges will be tricky. Then I thought of normalizing all ranges into similar scale. For this I can take average length of all ranges (good) or I can form uniform bins and count the random numbers in in those bins (Perfect way). — repinementer, Aug 07 '11 at 01:56

score 1 · Answer 1 · answered Aug 04 '11 at 12:58

The question is still not very well formed so I'm not sure I've quite understood what you want, but you probably want to use a combination of table and cut.

Your sample data

input1 <- data.frame(
  type  = paste("a", rep(1:2, times = c(3, 1)), sep = ""),
  lower = c(100, 250, 333, 450),
  upper = c(200, 270, 340, 460)
)

input2 <- data.frame(
  type = rep.int("a1", 28),
  lower = rep(c(101, 112, 258, 260, 332, 331), times = c(1, 1, 9, 5, 7, 5)),
  upper = rep(c(106, 117, 259, 262, 333), times = c(1, 1, 9, 5, 12))
)

First you define bins based upon the values in input1.

cut_points <- with(input1, sort(c(start, end)))

Then split input2$start by type, cut it up by bins and find the count in each.

with(input2, tapply(start, type, function(x) table(cut(x, cut_points))))

Possibly repeat with the end column.

with(input2, tapply(end, type, function(x) table(cut(x, cut_points))))

The simplest way of describing the question may be like this. input1 has 1 to 100 random numbers. input 2 has a range 10-20. I want to calculate how may random numbers that fall in 10-20 range. In my question the input has different length of ranges like 1-10, 20-25 and 80-100. In this case calculating random numbers that fall in these different ranges will be tricky. Then I thought of normalizing all ranges into similar scale. For this I can take average length of all ranges (good) or I can form uniform bins and count the random numbers in in those bins (Perfect way). — repinementer, Aug 07 '11 at 01:54

Binning different lengths in R

1 Answers1