-6

I have some data about specific jobs; the important parts being the start time and the end time of each specific job. I would like to plot the aggregated(count) of simultaneous jobs, with x-axis being time and y-axis the number of jobs running at that specific point of time.

Since it's my start into R I started with some preprocessing steps, like merging the date and time columns, converting into POSIXlt, calculating timediffs() and so on. Now I'm stuck a bit. I don't need code but I would appreciate any hint how to realize that pretty much.

Specifically I don't really know how to use the job's processing time as a process instead of just using the starting point

This here is my Data frame:

'data.frame':   10000 obs. of  7 variables:
 $ Process_name         : Factor 
 $ Process_start        : POSIXlt, format: "2009-12-23 03:44:38" 
 $ Process_end          : POSIXlt, format: "2009-12-23 03:44:42" 
 $ Process_duration(s)  : Class 'difftime'  atomic [1:10000] 4 75 1 2 1 
 $ ProcessIncludedInJob : Factor

I want to know how many jobs are running at a specific point of time simultaneously. A job is a process which is running for some time. During its run another job could start and run simultaneously f.g. I want to calculate and plot this circumstance for further analysis. My first approach was to plot date on the x and for example use either the startdate or enddate for the y-axis. But since every job is kind of a process and not just a point in time (start or end), I am not able to see how many jobs are running simultaneously. So I guess I must somehow use the Jobstart column and the Jobduration column.

smci
  • 32,567
  • 20
  • 113
  • 146
  • 2
    You are getting a bunch of down votes because you haven't included any way for people to provide an answer to your question. Use `dput()` to output your data in a format we can easily read it in as a bare minimum. Ideally you work through a subset of your data by hand and show us an example of what your inputs are and what outputs you want. [This post](http://stackoverflow.com/questions/5963269/how-to-make-a-great-r-reproducible-example) can help you with other ways to make your question better. – Barker Nov 03 '16 at 23:22
  • Yes, please edit the output from `dput(head(df,10))` into your question. Then we can get going... – smci Nov 19 '16 at 13:33

1 Answers1

1

I'll sketch an outline here, but we really need you to post reproducible data (please!):

  • at any time t, the num_running_processes = number of started processes - number of ended processes. (This is always going to be an integer between 0 and n)
  • which translates into df$num_running_processes <- sum(Process_start <= t) - sum(Process_end < t) . Note 'end < t', not '<='.

  • Now you don't really need to sample your time-interval divided into timesteps (of e.g. 1 min, or 5 sec or whatever), since you know that num_running_processes only ever changes value at any of the set of times either in Process_start or Process_end.

  • so your time-axis can be the set union: df$t <- union(Process_start, Process_end) . You have a non-uniform time-axis and that's ok. Note that it's also out-of-order, i.e. a new process could start before a previous one has ended. (We'll reorder things by ordering the dataframe by time-axis)
  • also compute another column df$num_running_processes as above
  • before you plot, sort or order your dataframe by time-axis df$t (dplyr library is nice for doing these manipulations)
smci
  • 32,567
  • 20
  • 113
  • 146
  • specifically, make sure you assign -1 to ending times and +1 to starting times, sort the data set into increasing time order, then just use `cumsum()` ... – Ben Bolker Nov 19 '16 at 13:50
  • 1
    Yeah I first thought of using `cumsum()`, but then I realized the time-axis is out-of-order so we need to compute all of `sum(Process_start <= t) - sum(Process_end <= t)` at each start- or end-timepoint. So yeah the alternative is to convert the dataframe into a dataframe with a `Start_Stop` and `time` columns with `+1` (start) events and start-times, and `-1` (end) events and end-times, reorder that dataframe by time, then compute `df$num_running_processes` up to that time point directly from cumsum on the Start_Stop (+1/-1) column. But that works if the dataframe was reordered by time axis. – smci Nov 19 '16 at 13:53