0

I have a data.frame representing the time sheet for several staff over a period of several months spanning 2 years. The data looks like:

Name         Month      1     2     3   ...  31    Total  Job     ... [more columns]

John Smith   Aug 2017   1:20                       1:20   Typing
Mary Jones   Sep 2017                                     Prooing
John Smith   Oct 2017         0:15           1:10  1:25   Typing
...
Jim Miles    Feb 2018         1:30  2:10           3:40   Admin

There are 31 columns, each representing a date in the corresponding month. There will be multiple rows with the same Name.

So looking at the first entry, John Smith did 1 hour and 20 minutes of work on 1 August 2017.

What I want to do is to analyse these data in a granular way, e.g.

  • How many hours did John Smith spend on Typing in Sept 2017?
  • How much Proofing was done in Jan-Feb 2018?

I am a bit stuck on how to proceed in order to have the data to analyse. Suggestions appreciated.

Added for clarification:

Having read three very helpful replies and looked at tidyr, I have clarified my thoughts and think that I need to modify the data so there is one row for each entry, so the example table will become:

Name         Date          Duration  Job     ... [more columns]

John Smith   01 Aug 2017   1:20      Typing
John Smith   02 Oct 2017   0:15      Typing
John Smith   31 Oct 2017   0:15      Typing
...
Jim Miles    02 Feb 2018   1:30      Admin
Jim Miles    03 Feb 2018   2:10      Admin

Date will need to be formatted correctly but that is not major. The problem is matching the day of month to the relevant Month and year to produce the composite date. Any ideas welcome.

Kaveh1000
  • 153
  • 1
  • 11
  • You can start [here](https://stackoverflow.com/questions/2185252/reshaping-data-frame-from-wide-to-long-format) to get tidy data, then use `dplyr::group_by` to get the different statistics required for each Employee or Task. – A. Suliman Aug 31 '18 at 11:37
  • I'm a fan of the [tidyverse](https://www.tidyverse.org) packages, specifically [dplyr](https://dplyr.tidyverse.org/) and [tidyr](https://tidyr.tidyverse.org/) for these kind of tasks. You can use `tidyr::gather()` to get [long data](https://stackoverflow.com/questions/34590173/long-and-wide-data-when-to-use-what) and then`dplyr::summarise()` to get aggregation statistics. You may want to try some of these approaches and then modify your question for a more specific and detailed problem. – Khlick Aug 31 '18 at 11:49
  • Thank you @Khlick. I have added a clarification with a more specific question. – Kaveh1000 Aug 31 '18 at 21:35
  • No worries! Thanks for clarifying the problem. I'll see if I can work up a solution if a good one doesn't show up in the meantime. Definitely look at the R [cheatsheets](https://www.rstudio.com/resources/cheatsheets/) at rstudio.com, specifically the [data wrangling](https://www.rstudio.com/wp-content/uploads/2015/02/data-wrangling-cheatsheet.pdf) and [data transformation](https://github.com/rstudio/cheatsheets/raw/master/data-transformation.pdf) cheatsheets.. – Khlick Sep 01 '18 at 00:54
  • 1
    Hi @Khlick. Please see my solution. Thanks for your help. I can't vote you it seems. – Kaveh1000 Sep 01 '18 at 11:00
  • Great @Kaveh1000 ! It seems like data wrangling, for some reason, is always the hardest part of any analysis! – Khlick Sep 01 '18 at 11:40

2 Answers2

0

I would approach this by converting total time spent to numeric. Depending on the structure of the data, you could split this string by a colon and convert minutes to hours and sum to get decimal hours.

Something along the lines of this:

x <- c("1:20", "1:25", "3:40")
x <- strsplit(x, ":")

sapply(x, FUN = function(m) {
  m <- as.numeric(m)
  sum(m[1], m[2]/60)
  })

[1] 1.333333 1.416667 3.666667

Then, you could use aggregate to sum by month-year and name.

aggregate(Total ~ Name + Month + Job, data = xy, FUN = sum)

If you need to report by month only, you would have to extract month name in one way or another, but nothing hard.

Roman Luštrik
  • 69,533
  • 24
  • 154
  • 197
0

After following up on the suggestions of @Khlick, I succeeded in using gather():

mydata <- mydata %>% gather(new_date, time_spent, "1":"31")

This produced two new columns, new_date and time_spent, then created a new row for each data point of columns 1–31.

So now I had, for each data point, the month, e.g. Aug 2017 in one column, the day the work was done, e.g. 12 in another. I changed the month to a date in the original spreadsheet, so it became 2017-08-01 (all dates now have 01). Then in R I used substr() and paste() to replace the day to the correct one, i.e. 2017-08-12.

Finally, I was left with a large number of rows with no value in time_spent. I removed those rows.

I now have:

Name Date Duration Job ... [more columns]

John Smith   2017-08-01   1:20      Typing
John Smith   2017-10-02   0:15      Typing
John Smith   2017-10-31   0:15      Typing
...
Jim Miles    2018-02-02   1:30      Admin
Jim Miles    2018-02-03   2:10      Admin

I did a few spot checks and seems data was transformed correctly. Thanks to all, especially to @Khlick.

Kaveh1000
  • 153
  • 1
  • 11