Subsetting a data frame by ddply, then applying a function with adply on the subset R

Question

I am having some trouble with formulating a logical piece of code using plyr. My problem involves two big dataframes of different lengths, with sample as below:

dfSample <-
 structure(list(Type = structure(c(8L, 100L, 86L, 86L, 86L, 86L, 
 33L, 8L, 105L, 44L, 36L, 107L, 107L, 78L, 33L, 105L, 99L, 10L, 
 16L, 75L), .Label = c("Alumni Services", "Anti-Virus and Malware", 
 "Application Integration", "Application Monitoring", "Application Testing", 
 "Audio Visual Support", "Audio Visual Support - CLS", "Audio Visual Support - Non-CLS", 
 "Backup Services", "Banner", "Bus and Law", "Business Analysis", 
 "Careers", "Common Learning Spaces", "Communication and Marketing", 
 "Computer Aided Assessment", "Conference Accounts", "Content Management", 
 "Database Services", "Datacentre", "Desktop Monitoring", "Desktop Software", 
 "Document Management", "Email", "Email Programs", "Encryption", 
 "Eng and the Enviro", "Equipment Disposal", "Estates and Facilities", 
 "Examination Papers", "Faculty Engagement", "Filestore Support Services", 
 "Finance Services", "General Admin Services", "General InfoSec Advice", 
 "Generic Accounts", "Grid Accounts (HPC)", "Health Sciences", 
 "High Performance Computing (HPC)", "Hosted webspace (LAMP/IIS)", 
 "HR and Payroll Services", "HR General", "HR Recruitment", "HR Systems", 
 "Hub Rooms", "Humanities", "ICT Facilities", "ID Card Services", 
 "Identity Management (User accounts)", "Identity Services", "Information Policy Breaches", 
 "Information Risk Analysis", "iSolutions Admin Services", "iSolutions Administration", 
 "IT Training and Development", "Large File Transfer", "Lecture Capture", 
 "Lecture Capture - CLS", "Lecture Capture - Non-CLS", "Legacy Corporate Systems", 
 "Library Services", "Licence Management", "Managed Print Service", 
 "Management Servers", "Media Asset Management", "Media Support", 
 "Medicine", "Meet and Greet", "Misuse and Security Incidents", 
 "Misuse Of Systems", "Mobile Apps", "Mobile Devices", "Natural and Enviro Sci", 
 "Network Access Services", "Network Services", "OS Builds", "Other Learning Systems", 
 "Personal Filestore", "Personal web pages", "Phys and Applied", 
 "Printing (Managed)", "Printing (Not MPS)", "Project Management and Resourcing", 
 "Repair", "Reporting Services", "Request for Software", "Research Filestore", 
 "Research Governance", "Research Management", "Research Output", 
  "Resource Filestore", "Risk Analysis and Assessment", "Security", 
 "Self Service Help", "Server Monitoring", "Service Hosting", 
 "ServiceLine", "Soc and Human Sci", "Software Configuration Management", 
 "Software Licensing and Management", "Software Services", "SportRec", 
 "Staff Accounts", "Staff Desktop Deployment", "Staff Desktop Services", 
 "Staff Desktop Services (Not UoS Build)", "Student Accounts", 
 "Student Admin Services", "Student Personal Workstations", "SUSSED", 
 "Switchboard", "Switchboard Infrastructure", "System Access Request", 
 "Telephony", "University Admin Services", "Unmanaged Printing", 
 "Videoconferencing", "Videoconferencing - CLS", "Videoconferencing - Non-CLS", 
 "Virtual Learning Environment (VLE)", "Visitor Accounts", "Web Statistics", 
 "Windows Core Environment"), class = "factor"), Tkt.Category = structure(c(19L, 
 17L, 17L, 17L, 17L, 17L, 2L, 19L, 5L, 2L, 9L, 9L, 9L, 4L, 2L, 
 5L, 20L, 2L, 19L, 20L), .Label = c("Communication and Collaboration", 
 "Corporate Services", "Data Centre", "Data Storage Services", 
 "Desktop IT", "Faculty IT", "Help Services", "HR", "Identity Management (User accounts)", 
 "Information Security", "Logistics", "Programmes and Projects", 
 "Quality and Testing", "Research Services", "Security", "SLO Corporate Services", 
 "Software", "Standard", "Teaching Services", "Underpinning Services", 
 "Web Services"), class = "factor"), `CreateDateTime` = structure(c(1370087940, 
 1370156160, 1370162340, 1370178840, 1370190000, 1370240400, 1370242920, 
 1370243040, 1370243040, 1370243280, 1370243280, 1370243520, 1370243580, 
 1370243880, 1370243880, 1370244000, 1370244120, 1370244240, 1370244300, 
 1370244360), class = c("POSIXct", "POSIXt")), `ClosingDateTime` = structure(c(1374501300, 
 1372068300, 1379062020, 1390487100, 1379062080, 1375090560, 1373984760, 
 1370856420, 1370440140, 1370508240, 1370338080, 1370243820, 1370243700, 
 1370255520, 1370341440, 1370248680, 1370353560, 1370338800, 1370257140, 
 1374222600), class = c("POSIXct", "POSIXt"))), .Names = c("Type", 
 "Tkt.Category", "CreateDateTime", "ClosingDateTime"
 ), row.names = c(NA, 20L), class = "data.frame")

And

DF2<-
 structure(list(DateTime = structure(c(1370041200, 1370052000, 
 1370062800, 1370073600, 1370084400, 1370095200, 1370106000, 1370116800, 
 1370127600, 1370138400, 1370149200, 1370160000, 1370170800, 1370181600, 
 1370192400, 1370203200, 1370214000, 1370224800, 1370235600, 1370246400
 ), class = c("POSIXct", "POSIXt"))), .Names = "DateTime", row.names = c(NA, 
 20L), class = "data.frame")

I am trying to get the length of the of a subset of dfSample based on some conditions which involve data from DF2 for each Tkt.Category as below:

QCalc <- function(m) {
  adply(DF2, 1, transform, q=as.character(
                               nrow(subset(m, CreateDateTime <= DateTime & 
                                              ClosingDateTime >= DateTime))))
}

ServiceQueue <- ddply(dfSample, .(Tkt.Category), QCalc)

This does not seem to work, so I guessed there must be problem with the way I formulated the function for the ddply part since the piece of code below works when I am using all my data(not grouping by Tkt.Category):

Q <- adply(DF2, 1, transform, q=as.character(
                                   nrow(subset(dfSample, CreateDateTime<= DateTime &
                                                         `ClosingDateTime>= DateTime))))

When using ddply, the error message I get is that the object 'm' cannot be found. Could someone point me to the right direction to fix this problem?

score 0 · Answer 1 · edited May 23 '17 at 12:05

If we can restate your problem, I think we can see an easier way to solve it. You want to count, for every type of ticket category and every timestamp in a list, how many tickets start before, end after, and have that category. In SQL we would write something like:

SELECT Tkt.Category, DateTime, count(*)
FROM dfSample join DF2 on
CreateDateTime<= DateTime 
and ClosingDateTime>= DateTime
GROUP BY Tkt.Category, DateTime

But this isn't SQL (although maybe it should be, are you pulling this data from a relational db?), its R - and base R doesn't allow us to merge using inequalities. So instead we can do a little trick with merge and avoid plyr all together:

dfSample$id <- rownames(dfSample)
DFc <- merge(dfSample,DF2)
DFlimited <- DFc[DFc$CreateDateTime <= DFc$DateTime & DFc$ClosingDateTime >= DFc$DateTime,]
DFagg <- aggregate(id ~ Tkt.Category + DateTime, data = DFlimited, length)

This might be pretty slow, depending on the size of your tables, since its essentially doing a full outer join and then filtering. If you find that is the case, look into the Data.Table package - you can check out this Stack Overflow question for more.

I am having a problem while merging the two dataframes, ass they are a=of different lengths (one has 70,816 rows and the other has 2921 rows). I have tried using all=TRUE, but it keeps freezing my computer, is there any other way to do it? — NarT, Aug 28 '14 at 14:45
I wanted to use plyr because further down the line i will have to group the counts by Type and Tkt.Category later on. — NarT, Aug 28 '14 at 14:47

Subsetting a data frame by ddply, then applying a function with adply on the subset R

1 Answers1

Linked