0

I'm new programming and R so I apologize if I'm not being clear enough in this question. I think my problem is two fold. I'll first try to give some context. One, I have a data frame within my data frame:

'data.frame':   27609 obs. of  2 variables:
 $ Diff : num  2557 2038 0 30 0 ...
 $ freq.:'data.frame':  27609 obs. of  1 variable:
  ..$ freq: int  85 68 1 31 1 35 1 1 34 42 ...

> head(d.f.)
  Diff freq
1 2557   85
2 2038   68
3    0    1
4   30   31
5    0    1
6 1034   35

I think this is causing my subsequent problem with mapply() below where I'd like to apply a function that, in each row, takes the value from one column, divides by a value in another column, then outputs a 1,2,3 or 4 depending on the range of values the quotient lies in.

myFunction = function(a,b) {
    interval = (a/b)
    ifelse(interval==0, 1 ,
           ifelse(interval<1, 2 , 
                  ifelse(interval<31, 3 , 4)))}

Test = mapply(myFunction, d.f.$Diff, d.f.$freq)
> Test
     [,1] [,2] [,3] [,4] [,5] [,6]
[1,]    3    3    1    2    1    3
[2,]    4    3    1    2    1    3
[3,]    4    4    1    3    1    4
[4,]    4    4    1    2    1    4
[5,]    4    4    1    3    1    4
[6,]    4    4    1    2    1    3

Above, Test is being run on only the first 6 rows. What ends up happening is Test takes forever to run on the entire d.f. and for some reason ends up outputting a matrix where the only values I'm interested in are in the first row. I'd really appreciate any help to get me to understand what I'm doing wrong. Thanks in advance!

Adam QC
  • 3
  • 2
  • 3
    how did you end up with a `data.frame` within your `data.frame` in the first place?? – MichaelChirico Sep 24 '15 at 20:18
  • My goal was to add the variable freq as a count of the number of occurrences of each unique value in a column in my original data.frame. So it looked something like this, d.f.$freq = with(D.F. , count(variable1)). For some context, the original d.f. is a list of transactions, where variable1 is Subscription.ID, so I wanted to count the number of transactions made by each Subscription.ID, then divide length of subscription (Max - Min transaction date) by the number of transaction to find daily, monthly, or annual subscription type by Subscription.ID – Adam QC Sep 24 '15 at 20:38

3 Answers3

2

Your function is vectorized:

myFunction(df$Diff, df$freq)
[1] 3 3 1 2 1 3

You can create a new column directly.

df$newcol <- myFunction(df$Diff, df$freq)
Pierre L
  • 28,203
  • 6
  • 47
  • 69
2

You are reinventing the wheel here. Although ifelse is vectorized it is well known being not the sharpest pencil in drawer. Certainly nesting them is usually a bad idea. Instead, you have very efficient cut and findInterval functions designed especially for such tasks. Here's an example usage

myFunc2 <- function(a, b) {
                     tol <- .Machine$double.eps
                     findInterval(a/b, c(0, 0 + tol, 1, 31 - tol, Inf))
                    }

And here's some speed comparison (you gain x20 speed up)

set.seed(123)
df <- data.frame(Diff = sample(1e3, 1e8, replace = TRUE),
                 freq = sample(1e2, 1e8, replace = TRUE))


system.time(res <- with(df, myFunction(Diff, freq)))
# user  system elapsed 
# 40.36   18.63  611.18 
system.time(res2 <- with(df, myFunc2(Diff, freq)))
# user  system elapsed 
# 1.89    0.83   76.64 
identical(as.integer(res), res2)
# [1] TRUE
Community
  • 1
  • 1
David Arenburg
  • 91,361
  • 17
  • 137
  • 196
  • Very neat function and thanks for the quick response, although I'm running into trouble implementing it. I receive this error: `Error in findInterval(a/b, c(0, 0 + tol, 1, 31 - tol, Inf)) : (list) object cannot be coerced to type 'double' ` – Adam QC Sep 28 '15 at 00:22
  • It's probably because you have a strange `data.frame` within a `data.frame` object. You need to make sure that both `a` and `b` are vectors. – David Arenburg Sep 28 '15 at 17:45
-1

If you turn it into a data.table (and call the variable dt):

> dt[, interval := Diff / freq]
> dt[, ifelse(interval == 0, 1, ifelse(interval < 1, 2, ifelse(interval < 31, 3, 4)))]
[1] 3 3 1 2 1 3
Chris Watson
  • 1,347
  • 1
  • 9
  • 24