1

I am using ddply within subset to calculate some metrics and roll up a table as required. Some of the metrics I want to calculate need to use the summarized columns created as a result of the ddply operation.

Here is the function with the simple calculated columns:

subset_by_market <- function (q, marketname, dp) {
  subset(ddply(df, .(quarter, R.DMA.NAMES, daypart, station), summarise, 
               spot.count = length(spot.id), 
               station.investment = sum(rate),
               nullspots.male = sum(nullspot.male),
               nullspots.allpersons = sum(nullspot.allpersons),
               total.male.imp = sum(male.imp),
               total.allpersons.imp = sum(allpersons.imp),
               spotvalue.male = sum(spotvalue.male),
               spotvalue.allpersons = sum(spotvalue.allpersons)),
         quarter == q & R.DMA.NAMES == marketname & daypart == dp)
}

I use subset_by_market ("Q32013" , "Columbus.OH", "primetime") to summarize create a subset. My resulting table looks like:

  quarter R.DMA.NAMES   daypart           station spot.count station.investment nullspots.male     nullspots.allpersons
10186  Q32013 Columbus.OH primetime ADSM COLUMBUS, OH        103               5150             67                   61
10187  Q32013 Columbus.OH primetime              ESYX         49                  0             49                   49
10188  Q32013 Columbus.OH primetime  MTV COLUMBUS, OH         61               4500              7                    1
10189  Q32013 Columbus.OH primetime     WCMH-Retro TV         94                564             93                   93
10190  Q32013 Columbus.OH primetime              WTTE          1                  0              0                    0
10191  Q32013 Columbus.OH primetime              WWHO          9                  0              2                    2
  total.male.imp total.allpersons.imp spotvalue.male spotvalue.allpersons
10186           47.2                127.7       4830.409            4775.1068
10187            0.0                  0.0            NaN                  NaN
10188          157.9                371.1       4649.746            4505.2608
10189            0.3                  0.3       3162.000            3162.0000
10190            3.5                 10.3        570.166             591.0231
10191            3.9                 15.8       7155.000            4356.4162

Question 1: I would like to add to the same data frame for e.g.: Percentage values of spot.count. = spot.count / sum(spot.count) (ii) percent.nullspots.male = nullspots.male / sum(nullspots.male)

However, when I add that to the ddply arguments, I get 1 (100%) in the resulting column. The value divides by itself instead of dividing by the sum of the column.

Question 2: This is slow and humbly I accept this may not be optimal coding. I am using an i5-2.6GHz PC with 16Gb ddr3 RAM with 64 bit OS. The dataset is 1M rows.

system.time(subset_by_market ("Q32013" , "Albuquerque.Santa.Fe", "late fringe"))
   user  system elapsed 
 228.13  176.84  416.12

The intention is to visualize all calculated metrics on an online dashboard and allow user to select the subset_by_market (q , marketname, dp) using drop-down menus. How can I make it faster?

ADDING SAMPLE DATA:

`> structure(list(market = c("Local", "Local", "Local", "Local", 
"Local", "Local", "Local", "NATIONAL CABLE", "Local", "Local"
), spot.id = c(11248955L, 11262196L, 11946349L, 11625265L, 12929889L, 
11259758L, 11517638L, 11599834L, 12527365L, 12930259L), date = structure(c(1375675200, 
1376625600, 1390280400, 1383627600, 1401249600, 1375848000, 1380772800, 
1383019200, 1397102400, 1401163200), class = c("POSIXct", "POSIXt"
), tzone = ""), hour = c(15, 17, 11, 18, 19, 1, 13, 14, 16, 22
), time = structure(c(0.642361111111111, 0.749305555555556, 0.481944444444444, 
0.770138888888889, 0.830555555555556, 0.0597222222222222, 0.582638888888889, 
0.597222222222222, 0.675694444444444, 0.930555555555556), format = "h:m:s", class = "times"), 
    local.date = structure(c(1375675200, 1376625600, 1390280400, 
    1383627600, 1401249600, 1375848000, 1380772800, 1383019200, 
    1397102400, 1401163200), class = c("POSIXct", "POSIXt"), tzone = ""), 
    local.hour = c(15, 17, 11, 18, 18, 0, 13, 14, 15, 22), local.time = structure(c(0.642361111111111, 
    0.749305555555556, 0.481944444444444, 0.770138888888889, 
    0.788888888888889, 0.0180555555555556, 0.582638888888889, 
    0.597222222222222, 0.634027777777778, 0.930555555555556), format = "h:m:s", class = "times"), 
    vendor = c("Time Warner - Myrtle Beach", "WMYD", "WSBK", 
    "WDCA", "Comcast - Memphis", "Charter Media - Birmingham", 
    "WBNA", "G4", "Comcast - Houston", "Comcast - Youngstown"
    ), station = c("VH-1 MYRTLE BEACH", "WMYD", "WSBK", "WDCA", 
    "COM MEMPHIS", "FX BIRMINGHAM", "WBNA", "G4", "SPK HOUSTON", 
    "COM YOUNGSTOWN CC"), male.imp = c(0, 2, 0, 0, 0.6, 0.4, 
    0, 0, 3.9, 0), women.imp = c(0, 2.5, 0, 2.5, 0.2, 0.6, 0, 
    0, 4.6, 0.6), allpersons.imp = c(0, 3.5, 0, 2.5, 0.8, 0.8, 
    0, 0, 7.8, 0.6), hh.imp = c(0, 8.5, 8, 64.5, 1.3, 2.9, 1.3, 
    15, 13.7, 1), isci = c("IT6140MB", "ITCD78DT", "IT6192BS", 
    "IT6170WD", "IT6173ME", "IT6162BI", "IT6155LO", "ITES13410", 
    "IT3917", "IT3921"), creative = c("Eugene Elbert (Bach. Tcom Eng. Tech) :60", 
    "The Problem Solvers (revised) - IET :60", "Murtech/Kinetic/Integra :60", 
    "Kevin Bumper/NTSG/Lifetime :60", "NCR/Schlumberger/Sprint (revised) (Bach) :60", 
    "Skills Gap (revised) /Kevin :60", "Rising Costs60 (Opportunity Scholar - No Nursing)", 
    "Irina Lund (Bach. ISS) :60", "Augustine Lopez (A. CEET) :30 (no loc)", 
    "John Ryan Ellis (B. PM/A. CDD) :30 (no loc)"), program = c(NA, 
    "TYLER PERRY'S MEET THE BROWNS", "THE PEOPLE'S COURT", "Judge Judy", 
    NA, NA, "Meet the Browns/Are We There Yet/News/Wendy Willia", 
    "HEROES", "Spike EF Rotator", NA), rate = c(5, 230, 100, 
    625, 40, 0, 15, 40, 110, 7), R.DMA.NAMES = c("Myrtle.Beach.Florence", 
    "Detroit", "Boston.Manchester.", "Washington.DC.Hagrstwn.", 
    "Memphis", "Birmingham.Ann.and.Tusc.", "Louisville", "NATIONAL CABLE", 
    "Houston", "Youngstown"), date.time = c("2013-08-05 15:25:00", 
    "2013-08-16 17:59:00", "2014-01-21 11:34:00", "2013-11-05 18:29:00", 
    "2014-05-28 19:56:00", "2013-08-07 01:26:00", "2013-10-03 13:59:00", 
    "2013-10-29 14:20:00", "2014-04-10 16:13:00", "2014-05-27 22:20:00"
    ), daypart = c("afternoon", "evening", "morning", "evening", 
    "evening", "late fringe", "afternoon", "afternoon", "afternoon", 
    "primetime"), quarter = structure(c(4L, 4L, 1L, 6L, 3L, 4L, 
    6L, 6L, 3L, 3L), .Label = c("Q12014", "Q22013", "Q22014", 
    "Q32013", "Q32014", "Q42013"), class = "factor"), cpi.allpersons = c(96.2179487179487, 
    79.0114068441065, 35.1219512195122, 82.3322348711803, 30, 
    0, 138.721804511278, 28.3135215453195, 28.2384088854449, 
    86.6666666666667), cpi.male = c(750.5, 188.882673751923, 
    115.959004392387, 144.492639327024, 38.9847715736041, 0, 
    595.161290322581, 34.7402005469462, 62.010777084515, 156.712328767123
    ), spotvalue.allpersons = c(0, 276.539923954373, 0, 205.830587177951, 
    24, 0, 0, 0, 220.25958930647, 52), spotvalue.male = c(0, 
    377.765347503846, 0, 0, 23.3908629441624, 0, 0, 0, 241.842030629609, 
    0), nullspot.allpersons = c(1, 0, 1, 0, 0, 0, 1, 1, 0, 0), 
    nullspot.male = c(1, 0, 1, 1, 0, 0, 1, 1, 0, 1)), .Names = c("market", 
"spot.id", "date", "hour", "time", "local.date", "local.hour", 
"local.time", "vendor", "station", "male.imp", "women.imp", "allpersons.imp", 
"hh.imp", "isci", "creative", "program", "rate", "R.DMA.NAMES", 
"date.time", "daypart", "quarter", "cpi.allpersons", "cpi.male", 
"spotvalue.allpersons", "spotvalue.male", "nullspot.allpersons", 
"nullspot.male"), row.names = c(561147L, 261262L, 89888L, 941010L, 
500366L, 65954L, 484053L, 598996L, 380976L, 968615L), class = "data.frame")`

Apologies for the ugly dput.

vagabond
  • 3,526
  • 5
  • 43
  • 76
  • 1
    You could significantly speed up those calculations by using `dplyr`, a newer version of `plyr` for data.frame like objects. dplyr might also save you some typing because of functions like `summarise_each` where you can apply one or more functions to a number of columns at the same time (like `sum` in your example) without writing for each column `total_x = sum(x)`. – talat Nov 09 '14 at 21:37
  • thanks, I'll try and re-adapt to `dplyr` – vagabond Nov 09 '14 at 21:39
  • If you could include some sample data in your question, it would be easy to show you how it could be done in dplyr. – talat Nov 09 '14 at 21:40
  • @beginneR, don't forget to ping me when they do, so I'll add a superior `data.table` solution too :) – David Arenburg Nov 09 '14 at 21:43
  • sure .. depending the definition of "superior". ;) (@DavidArenburg) – talat Nov 09 '14 at 21:44
  • @beginneR, it will become superior after Arun will completely rewrite it of course :) – David Arenburg Nov 09 '14 at 21:44
  • ok , can I add a `dput` version of the sample? – vagabond Nov 09 '14 at 21:45
  • @beginneR done, sample data added. – vagabond Nov 09 '14 at 22:07
  • i tried `df %.% group_by(quarter, R.DMA.NAMES, daypart, station) %.% summarize (spot.count = length(spot.id))` but I keep getting error message `Error in eval(expr, envir, enclos) : column 'time' has unsupported type`. The class of "time" is 'times' from package `(chron)`. I tried converting it to `POSIXct` but it didn't work. I tried to `unclass` it, but that did not work either. As a last resort I deleted the columns which were of class `times`. But I don't want to lose that information really. The code did become a lot faster though. – vagabond Nov 11 '14 at 02:16
  • I'm curious - why is no one engaging on this thread anymore? Is there some SO specific direction I have missed? – vagabond Nov 11 '14 at 14:42

1 Answers1

0

This answers only my second question related to making the function faster. Based on @beginneR tip, I converted the function to dplyr.

subset_by_market <- function (q, marketname, dp) {

  subset(df %>% group_by(quarter, R.DMA.NAMES, daypart, station) %>%
  summarize (spot.count = length(spot.id), station.investment = sum(rate),
             nullspots.male = sum(nullspot.male),
             nullspots.allpersons = sum(nullspot.allpersons),
             total.male.imp = sum(male.imp),
             total.allpersons.imp = sum(allpersons.imp),
             spotvalue.male = sum(spotvalue.male),
             spotvalue.allpersons = sum(spotvalue.allpersons),
             male.imp.per.spot = total.male.imp / spot.count,
             allpersons.imp.per.spot = total.allpersons.imp / spot.count,
             cost.per.spot = station.investment / spot.count,
             male.value.per.spot = spotvalue.male / spot.count,
             allpersons.value.per.spot = spotvalue.allpersons / spot.count), 
  quarter == q & R.DMA.NAMES == marketname & daypart == dp) }

This reduced the time drastically to :

> system.time(subset_by_market ("Q32013" , "Albuquerque.Santa.Fe", "late fringe"))
   user  system elapsed 
   1.06    0.00    1.09 

The glitch I faced in using dplyr was a column called "time" in my data which was of class times from package chron. I kept receiving the error Error: column 'local.time' has unsupported type . I couldn't figure the exact work around for this so I simply converted it to POSIXct class using df$time <- as.POSIXct(as.character(df$time, format = "%H:%M:%S")) . This was not optimal because the reason I converted it to times using chron was to maintain the time chronology without needing the date or time zone. More on that here: work around to specifying date range in ifelse statement in R. However, it solves the immediate problem at hand.

Community
  • 1
  • 1
vagabond
  • 3,526
  • 5
  • 43
  • 76