1

I'm trying to figure out the five fastest-growing topics for each quarter. I have a dataframe (will call it df) in R with three columns - a quarter number (df$QNum), the topic (df$Topic) and the number of records of that topic that quarter (df$Total_Hits).

Here's an example of how my dataframe df looks:

    Total_Hits          Topic                 QNum
        10              Technology            1
        86              Video Conferencing    1
        14              Video Conferencing    2
        10              Technology            3
         1              Video Conferencing    1
        12              Technology            21

I want to create a new column in df, df$QonQGrowth that, for each record, calculates the growth of Hits on that Topic over the previous quarter. I don't mind how it looks for df$QNum=1 but for the third record in this example, it would calculate: (Total_Hits/(Total_Hits, where Topic="Video Conferencing" and QNum=1)-1)

I think it would look something like the following but can't quite figure it out:

df$QonQGrowth <- (df$Total_Hits / ([a lookup of Total_Hits for df$Topic and (df$Qnum-1)?]))-1

The data set is pretty large so it's possible that there won't be a record for every topic every quarter.

Similar questions here and here but they're not doing exactly what I need.

Edit: This question also seems like it might be useful, using ddply or aggregate.

Thanks so much in advance!

Community
  • 1
  • 1
jegeragh
  • 85
  • 1
  • 7

1 Answers1

4

I'm sure there is a cleaner way to do this, but here is an ugly plyr solution:

> #make junk data
> set.seed(3)
> dat <- data.frame(total_hits=round(runif(25,0,50)),topic=sample(c("A","B","C","D"),25,T),qnum=round(runif(25,1,6)))
> dat <- dat[row.names(unique(dat[,2:3])),]
> dat
   total_hits topic qnum
1           8     D    2
2          40     C    2
3          19     D    5
4          16     C    6
6          30     B    6
7           6     B    3
8          15     A    2
9          29     B    2
11         26     B    5
12         25     D    1
13         27     A    5
18         35     A    4
19         45     C    3
20         14     B    1
23          6     A    6
25         12     D    4
> 
> #get you qonqgrowth variable
> library(plyr)
> ddply(dat,.(topic,qnum),summarize,qonqgrowth=ifelse(any(dat$qnum==qnum-1 & dat$topic == topic),total_hits/(dat$total_hits[dat$qnum == qnum-1 & dat$topic == topic]),NA))
   topic qnum qonqgrowth
1      A    2         NA
2      A    4         NA
3      A    5  0.7714286
4      A    6  0.2222222
5      B    1         NA
6      B    2  2.0714286
7      B    3  0.2068966
8      B    5         NA
9      B    6  1.1538462
10     C    2         NA
11     C    3  1.1250000
12     C    6         NA
13     D    1         NA
14     D    2  0.3200000
15     D    4         NA
16     D    5  1.5833333
David
  • 9,284
  • 3
  • 41
  • 40
  • Maybe complicated, but not all that ugly :-) . – Carl Witthoft Sep 06 '13 at 11:45
  • Exactly what I wanted, thanks so much! Unfortunately, it doesn't work on my dataset as it's too large :( But it's good to see how ddply can be used. – jegeragh Sep 07 '13 at 20:06
  • What do you mean it doesn't work? Are you getting error messages? Is it just too slow? You can easily run any ddply in parallel if you have multiple cores and enough memory but a data.table solution may a better option. – David Sep 07 '13 at 20:18
  • @David It looks like it's just too slow - I've let it run for a few hours with no results. Thanks for the suggestion! A little off topic - but I've added .parallel=TRUE to the end of the ddply() and I'm getting "Error: foreach package required for parallel plyr operation" – jegeragh Sep 07 '13 at 23:37
  • 1
    That's telling you that you need to install the `foreach` package in order to run `ddply()` in parallel, run: `install.packages('foreach')` – David Sep 08 '13 at 14:39
  • Thank you @David!! I got it by splitting my dataset into smaller chunks and it works perfectly - appreciate your help :) – jegeragh Sep 08 '13 at 19:44