-7

I have a large vector of percentages (0-100) and I am trying to count how many of them are in specific 20% buckets (<20, 20-40, 40-60,60-80,80-100). The vector has length 129605 and there are no NA values. Here's my code:

x<-c(0,0,0,0,0)
for(i in 1: length(mail_return))
{
    if (mail_return[i]<=20)
    {
        x[1] = x[1] + 1
    }
    if (mail_return[i]>20 && mail_return[i]<=40)
    {
        x[2] = x[2] + 1
    }
    if (mail_return[i]>40 && mail_return[i]<=60)
    {
        x[3] = x[3] + 1
    }
    if (mail_return[i]>60 && mail_return[i]<=80)
    {
        x[4] = x[4] + 1
    }
    else
    {   
        x[5] = x[5] + 1
    }
}

But sum(x) is giving me length 133171. Shouldn't it be the length of the vector, 129605? What's wrong?

Concerned_Citizen
  • 6,548
  • 18
  • 57
  • 75
  • Very shortly, you're going to want to marry the functions `cut` and `table`. – joran Nov 07 '12 at 22:36
  • Also, take a look at `ifelse`. –  Nov 07 '12 at 22:37
  • 3
    Why can't you all just elaborate? – Concerned_Citizen Nov 07 '12 at 22:40
  • 3
    @GTyler - Because this sort of question has been answered several times before: http://stackoverflow.com/questions/5570293/r-adding-column-which-contains-bin-value-of-another-column http://stackoverflow.com/questions/5746544/r-cut-by-defined-interval S.O. is not a replacement for research. – thelatemail Nov 07 '12 at 22:47
  • 3
    Why are the negative votes? I didn't know it was the intervals that went wrong. I thought it was something else. – Concerned_Citizen Nov 07 '12 at 23:13
  • Speaking only for myself: when provided two helpful pointers to some very useful functions, you responded with a rude and demanding comment, rather than simply reading the documentation. – joran Nov 07 '12 at 23:31
  • I think `GTyler` has a point. `joran`, the documentation is good to refer him to, but I think an explanation is in order here, because he went to good lengths to try to solve this himself, and his APPROACH is clearly off (using `for` &`if` for simple tasks) - so he deserves a nod in the right direction. – Señor O Nov 07 '12 at 23:35
  • 4
    @user1717913 Perhaps I didn't have time to write a full answer and was trying to be as helpful as I could? Rather than a "Thanks!" I get a whiny comment that I didn't do more. I don't think it's unreasonable to expect better behavior from question askers. – joran Nov 07 '12 at 23:39
  • I see your point. I do understand though, that seeing nothing but short, nonspecific responses would be frustrating, because it would appear as though people felt they had answered you. – Señor O Nov 07 '12 at 23:42
  • @joran You'll get a "Thanks!" when you give a comprehensive response. As for being demanding, I want more than just pointers which may lead to more hair-pulling for me (and I don't have much hair!). By the time you typed up that comment, you can probably come up with a decent response. Thank you and good day to you. – Concerned_Citizen Nov 08 '12 at 00:26

2 Answers2

10

I like findInterval for these sorts of tasks:

x <- c(1,2,3,20,21,22,40,41,42,60,61,62,80,81,82)
table(findInterval(x,c(0,20,40,60,80)))


1 2 3 4 5 
3 3 3 3 3 
thelatemail
  • 91,185
  • 12
  • 128
  • 188
  • I would've used `table(cut(x, breaks=c(0,20,40,60,80,100)))`, but I like the cleaner output of `findInterval` - thanks `latemail`! As a side note, `GTyler`, although you don't need the `&` operator here, `&&` is not the same in R as it is in other languages - it takes only the FIRST object in a vector - probably the reason for your error. I've never encountered a situation where `&` is not preferred. – Señor O Nov 07 '12 at 23:38
  • @user1717913: `&&` is almost always preferred in `if` statements. From `?"&&"`: "The longer form is appropriate for programming control-flow and typically preferred in ‘if’ clauses." – Joshua Ulrich Nov 07 '12 at 23:42
  • Thanks for pointing that out - I must have misunderstood when I first learned this. Is `cond1 && cond2` identical to `all(cond1 & cond2)`? (in result and/or speed)? – Señor O Nov 07 '12 at 23:47
  • @user1717913: No, the second one evaluates more than the first element of `cond1` and `cond2`, whereas `&&` only evaluates the first element of each. – Joshua Ulrich Nov 08 '12 at 03:26
2

The reason for the bad count is that
x[5] effectively counts every occurrence which doesn't satisfy the condition
mail_return[i]>60 && mail_return[i]<=80,
i.e. counting items that are > 80 (as you would expect), but also counting anew items that are <= 60 (outch! that the bug!).

You can replace...

if (mail_return[i]>60 && mail_return[i]<=80)
{
    x[4] = x[4] + 1
}
else
{   
    x[5] = x[5] + 1
}

by...

if (mail_return[i]>60 && mail_return[i]<=80)
{
    x[4] = x[4] + 1
}

if (mail_return[i] >80)
{   
    x[5] = x[5] + 1
}

...to fix things.

But as hinted in other answers, there are better idioms to find counts (such as table(findInterval(...)) ) which do not require such longhand code (and which are more efficient).

mjv
  • 73,152
  • 14
  • 113
  • 156