0

I am trying to bucket certain features into groups. The data.frame below (grouped) is my "key" (think Excel vlookup):

          Original  Grouped
1         Features Constant
2     PhoneService Constant
3    PhoneServices Constant
4       Surcharges Constant
5     CallingPlans Constant
6            Taxes Constant
7          LDUsage    Noise
8    RegionalUsage    Noise
9       LocalUsage    Noise
10       Late fees    Noise
11 SpecialServices    Noise
12         TFUsage    Noise
13       VoipUsage    Noise
14         CCUsage    Noise
15         Credits  Credits
16         OneTime  OneTime

I then reference my database which has a column (BillSection) that takes on a specific value from grouped$Original, and I want to group it according to grouped$Grouped. I am using the sapply function to perform this operation. Then I cbind the resulting output to my original data.frame.

grouper<-as.character(sapply(as.character(bill.data$BillSection[1:100]), # for the first 100 records of the data.frame bill.data
       function(x)grouped[grouped$Original==x,2])) # take the second column, i.e. Grouped, for the corresponding "TRUE" value in Original
cbind(bill.data[1:100,],as.data.frame(grouper))

The above code works as expected, but it's slow when I apply it to my whole database, which exceeds 10,000,000 unique records. Is there an alternative to this method? I know I can use plyr, but it's even slower (I think) than sapply. I was trying to figure it out with data.table but no luck. Any suggestions would be helpful. I am open to coding this in Python, which I am new to, but heard is much faster than R, since I am dealing with large datasets very often. I wanted to know if R can do this fast enough to be useful.

Thanks!

ilanman
  • 818
  • 7
  • 20
  • 1
    I can't tell what you're trying to accomplish here (maybe describe it in words rather than code?), but if you're `cbind`ing repeatedly for every group of 100 that could [easily kill performance](http://stackoverflow.com/questions/2908822/speed-up-the-loop-operation-in-r/8474941#8474941). – Ari B. Friedman Feb 11 '13 at 17:11
  • I'm only `cbind`ing once, at the end. I am trying to "lookup" a value in bill.data$BillSection (which will be "Features", "OneTime", "PhoneServices", etc... and group it into either "Constant", "Noise", "OneTime", or "Credits"). This will be used for plotting later on. – ilanman Feb 11 '13 at 17:23

1 Answers1

2

I'm not sure I understand your question, but can you use merge()? i.e. something like...

merge(big.df, group.names.df, by.x='orginal.column.in.big.df',
      by.y='original', all.x=T)

NB. Plyr has a parallel option...

Sean
  • 3,765
  • 3
  • 26
  • 48
  • 1
    This is perfect. Just an FYI, here are some system time comparisons, using my original method (not including `cbind`ing) on 100K records: `user system elapsed` `22.05 0.00 22.07` And now using Sean's suggestion: `user system elapsed 1.10 0.03 1.12 ` – ilanman Feb 11 '13 at 17:32
  • 2
    Good news! If you like the answer, you can click up and accept! :) – Sean Feb 11 '13 at 17:39