Combine two data sets using date & time ( Sales data with employee attendance)

Question

> sales.csv
        CustomerName    InvoiceDate_Time InvoiceNo InvoiceValue
1    Hendricks, Eric  30-09-2015 1:00 PM        10         5000
2        Baker, Mark  30-09-2015 3:00 PM        11        12000
3   Catalano, Robert 01-10-2015 10:00 AM        12        25000
4     Eaton, Jeffrey  01-10-2015 4:00 PM        13         4000
5    Watanuki, Cathy  02-10-2015 9:00 AM        14        80000
6      Fier, Marilyn  02-10-2015 3:30 PM        15        18000
7     O'Brien, Donna  03-10-2015 1:30 PM        16        25000
8      Perez, Barney  03-10-2015 4:10 PM        17        20000
9 Fitzgerald, Jackie 04-10-2015 11:10 AM        18         6000


> StaffAttendance.csv
       EmployeeName Designation AttendanceIn.DateTime AttendanceOut.DateTime
1        Page, Lisa   Sales Rep    30-09-2015 6:50 AM     30-09-2015 2:00 PM
2    Taylor, Hector     Manager    30-09-2015 7:00 AM     30-09-2015 5:00 PM
3  Dawson, Jonathan   Sales Rep    30-09-2015 1:55 PM     30-09-2015 7:00 PM
4      Duran, Brian   Sales Rep    01-10-2015 6:50 AM     01-10-2015 7:00 PM
5       Pratt, Erik     Manager    01-10-2015 7:20 AM     01-10-2015 5:10 PM
6        Page, Lisa   Sales Rep    02-10-2015 6:55 AM     02-10-2015 6:45 PM
7    Taylor, Hector     Manager    02-10-2015 7:10 AM     02-10-2015 5:20 AM
8      Weber, Larry   Sales Rep    03-10-2015 6:50 AM     03-10-2015 6:55 PM
9       Pratt, Erik     Manager    04-10-2015 7:20 AM     04-10-2015 5:10 PM
10     Duran, Brian   Sales Rep    04-10-2015 7:10 AM     04-10-2015 7:00 PM

As above I have two data tables(CSV files), which I want to combine using date & time. How can I combine using date & time, to find which employees worked for each sales made to customers?

How can I get the resultant table save as a CSV file?

Pls. state the R commands to be used in step by step. Also can I do this in tableau. What are the steps?

Can you provide a minimum working example (see http://jaredknowles.com/journal/2013/5/27/writing-a-minimal-working-example-mwe-in-r) and `dput` snippets of your data so we can work with it? Also, check http://stackoverflow.com/questions/8185201/merge-records-over-time-interval and http://stackoverflow.com/questions/21560500/data-table-merge-based-on-date-ranges which with similar issues as you have. — Felix, Nov 09 '15 at 17:48
You want to combine the two data frames using date and time column as variable to match, but as I see there is no match on date and time between the two. Could you provide an example of what you want to achieve? — Vasile, Nov 09 '15 at 19:49
I want to get the Employee name records from table 2, by matching who are the employee work for the a given sale. So it may replicate the sales record to show when there is multiple employees are worked on particular time. — Klllmmm, Nov 10 '15 at 04:34

Felix · Accepted Answer · 2015-11-16T13:45:59.363

0

Ok, so here is a potential dplyr / data.table / tidyr solution. The general idea is to use the list variable feature of dplyr since version 0.4.0. For each customer we select the employees who were present at the time of his visit (using data.table's between() function) and store them in a list for each customer. We then unnest() the list variable (which copies the customer entry for each unique employee) and merge back the employee information. This results in a data frame of unique customer-staff combinations.

library(dplyr)
library(readr)
library(tidyr)
library(data.table)

#########
# For reproducibility: you can also download the .csv 
# from these Dropbox links using the 'repmis' pkg 
#
# customer <- repmis::source_DropboxData("customer.csv",
#                            "q0sf4uj13hpjz9v",
#                            sep = ",",
#                            header = TRUE)
# 
# staff <- repmis::source_DropboxData("staff.csv",
#                                     "q8p16hchsx8dzoa",
#                                     sep = ",",
#                                     header = TRUE)
##########    

# One problem with the original .csv is the formatting of the time: the
# hour is given with a single digit; not in the format 0+digit. We therefore 
# use '%k' in as.POSIXct() to parse the time instead of %H:

customer <- read_csv("https://www.dropbox.com/s/q8p16hchsx8dzoa/staff.csv?dl=1") %>% 
  mutate(date = as.POSIXct(date, "%d-%m-%Y %k:%M", tz = "Europe/Berlin"))

staff <- read_csv("staff.csv")  %>% 
  mutate(start = as.POSIXct(start, "%d-%m-%Y %k:%M", tz = "Europe/Berlin"),
         end = as.POSIXct(end, "%d-%m-%Y %k:%M", tz = "Europe/Berlin"))

# Now we group by customer and copy for each customer 
# the list of employee names who were present at the date of the customer interaction:

staff_customer <- customer %>% 
  group_by(c.name) %>% # for each customer....
  mutate(employee = list(staff[data.table::between(date, staff$start, staff$end), c("employee", "Record ID")])) %>% # ... select all employees which were present during the customer's visit and store them in a list
  unnest() %>% # unnest this list using tidyr
  left_join(., staff) # copy the staff information back (if necessary)

This results in the following table (only first 10 rows shown):

Source: local data frame [6 x 7]

  RecordID                  c.name                date                  employee Record ID               start                 end
1      826 NYLLYNYBCSNCYCKC/BL MRS 2014-10-07 06:35:00   Miss. A A D R ATHAPATTU      2612 2014-10-06 18:05:00 2014-10-07 08:27:00
2      826 NYLLYNYBCSNCYCKC/BL MRS 2014-10-07 06:35:00   Mr. K K R C CHATHURAPLA      2650 2014-10-06 18:05:00 2014-10-07 08:37:00
3      826 NYLLYNYBCSNCYCKC/BL MRS 2014-10-07 06:35:00    Mr. R P C P RAJAPAKRHA      2596 2014-10-06 18:05:00 2014-10-07 08:03:00
4      826 NYLLYNYBCSNCYCKC/BL MRS 2014-10-07 06:35:00        Mr. E A M LUDDHIKA      2699 2014-10-06 18:26:00 2014-10-07 08:31:00
5      826 NYLLYNYBCSNCYCKC/BL MRS 2014-10-07 06:35:00         Mr. R W P L RILVA      2673 2014-10-06 18:27:00 2014-10-07 08:26:00
6      826 NYLLYNYBCSNCYCKC/BL MRS 2014-10-07 06:35:00 Mrs. D.A.R.R. KARUPARATPE      2565 2014-10-06 18:31:00 2014-10-07 08:28:00

Unfortunately I don't know how this would work in Tableau.

edited Nov 16 '15 at 13:45

answered Nov 10 '15 at 13:09

Felix

1,611
13
22

Mr. Felix, Its exactly what I wanted. Thank you very much. – Klllmmm Nov 10 '15 at 15:57
How can I apply this command to a big database where 1900 customer records with 4 variables (256KB) & 48600 staff attendance records with 3 variables (2.9MB) are in? I'm using R-studio for windows 32bit. Is there 64bit r-studio for windows? How to overcome this error (Cot. next comment) – Klllmmm Nov 10 '15 at 16:16
'Error: cannot allocate vector of size 354.0 Mb In addition: Warning messages: 1: In NextMethod("[") : Reached total allocation of 3996Mb: see help(memory.size) 2: In NextMethod("[") : Reached total allocation of 3996Mb: see help(memory.size) 3: In NextMethod("[") : Reached total allocation of 3996Mb: see help(memory.size) 4: In NextMethod("[") : Reached total allocation of 3996Mb: see help(memory.size)' Thanks in advance... – Klllmmm Nov 10 '15 at 16:16
Ok, I was worried about the size of your original files. The method I'm proposing definitely does not scale b/c it generates a combination of all customer-staff combinations which is in your case 92 million. There is a 64bit version of R, but for the number of observations you're dealing with my proposed solution is definitely not suitable. – Felix Nov 10 '15 at 17:07
@Klllmmm I edited my answer and use a dplyr / tidyr approach now. Pls check and see if that works. – Felix Nov 10 '15 at 17:24
Tried, but its not successful, Error message comes as "Assertion failed"! Programme : C:\programe files\RStudio\bin\x64\rsession.exe File: D:/recompile/CRAnpkg/lib/3.2/BH/include/boost/smart_ptr/scoped_ptr.hpp,Line99 Expression: px !=0 Also message pop up "R Session aborted, R encountered a fatal error. The session was terminated." – Klllmmm Nov 10 '15 at 18:14
Can you be a bit more specific what exactly failed? The old code, or the one I edited? My new solution should work with less memory than the one before. So no need to update R to 64bit (unless, you're on a 64bit machine). Also, all of this would be much easier if you provided a reproducible example or posted some of your data. Without this, it is difficult to help with the exact problems that you are encountering. – Felix Nov 10 '15 at 18:38
Sorry abt it, I worked out with new code given by u, It perfectly work with the data set you are using. But when I apply to large volume of data that error occurs. – Klllmmm Nov 11 '15 at 00:18
I would probably need a subset of your original data to work with, to further investigate the issue. I suggest you read up on how to produce a reproducible example (see [here](http://stackoverflow.com/questions/5963269/how-to-make-a-great-r-reproducible-example)) and how to generate a sample of your data for us to use (e.g. with `dput()`). – Felix Nov 11 '15 at 08:42
is there any other way to give my data set. I reduce the data set arbitrarily. ( for a selected customers & staff for a particular month), But the resulting data set is not produces as intended. (Employee , start, end are all state as NA) Can i email the data set if possible? – Klllmmm Nov 15 '15 at 14:58
I added two CSV files. customer.csv & staff.csv . customer.csv has columns such as RecordId , c.name , date (only 35 records). And staff.csv has column such as Record ID , employee , start , end ( only 160 records). – Klllmmm Nov 15 '15 at 15:16
Where did you add the .csv files? To save data frames as csv do `write.csv(name_of_your_dataframe, "customer.csv")`. If you would like to provide the entire datasets upload your csv files here: http://expirebox.com/ and provide a link. – Felix Nov 15 '15 at 16:23
Yes, Staff attendance detail data http://www.expirebox.com/download/0b1f1bf2d5e0790ba1143e9f535a57e1.html customer data http://www.expirebox.com/download/0b75e9e0377ccec7a2fd9c015c859159.html resultant data from the code http://www.expirebox.com/download/092f268a5a2d4529197f2982d1f7752e.html thank you very much for your time & effort – Klllmmm Nov 15 '15 at 17:07
Codes I used were library(dplyr) library(tidyr) library(lubridate) library(data.table) customer <- read.csv("C:\\upload\\customer.csv", header = T) staff <- read.csv("C:\\upload\\Staff.csv", header = T) – Klllmmm Nov 16 '15 at 00:40
staff_customer <- customer %>% group_by(c.name) %>% # for each customer.... mutate(employee = list(staff[data.table::between(date, staff$start, staff$end), "employee"])) %>% # ... select all employees which were present during the customer's visit and store them in a list unnest() %>% # unnest this list using tidyr left_join(., staff) # copy the staff information back (if necessary) write.csv(staff_customer, "C:\\upload\\staffcustomer123.csv") – Klllmmm Nov 16 '15 at 00:40
@Klllmmm I've edited my answer to reflect the structure of your .csvs. From what I can tell, there was an issue with reading the time due to an uncommon time format (single digits for hours instead of double digits). The rest of the solution works as intended. I updated the answer with the correct code, check it out. Also, I've added to links to the csv on a more permanent basis, so that others can reproduce the solution. If you would like me to take that down let me know. – Felix Nov 16 '15 at 13:49
@ Mr.Felix, I can change the time format to double digit.(02/10/2014 06:50) in both CSV files. Then how we can amend the code change? – Klllmmm Nov 19 '15 at 17:22
If you have double digits for the time, change `%k` in the `as.POSIXct()` command to `%H`. Then it should work accordingly; no other changes needed. – Felix Nov 20 '15 at 12:58
Thanks Mr. Felix, I tried lot but failed to change time format in my CSV files. So i gave up. Now I want to select all records from "staff_customer" table, that satisfied below criteria . 1. "employee" name is exactly similar "Mrs. D.A.R.R. KARUPARATPE" 2. "employee" name includes " KARUPAR". Like wise i have several criteria, & the such satisfied records need to take into a separate data tables. How do i code for these two requirment? Thank you very much for the support. – Klllmmm Nov 22 '15 at 04:15
I am sorry that you weren't able to change the time format. `R`'s `as.POSIXct()` is quite flexible in reading different time formats. I don't understand how and why the .csv you have differs in time format that you gave me in your example data. Also, it is much easier for us here on SO to help you if you provide some context to your question; what kind of data is it? Where does it come from? How do you generate your .csv? Is the example data different from the data you actually need to process? – Felix Nov 23 '15 at 11:05
I don't mean to sound harsh, but it is very difficult (and frustrating for us) to try to help if we don't have enough information. – Felix Nov 23 '15 at 11:08
Your follow-up question constitutes a new question which is not directly related to the question you asked here, but has been answered on Stack Overflow before. Before you open a new question, make sure you've researched all the answers. Read [here](http://meta.stackoverflow.com/questions/260648/stack-overflow-question-checklist) and [here](http://stackoverflow.com/help/how-to-ask) how to ask a good SO question. If you follow this advice, the resulting answers are much more likely to be satisfactory. – Felix Nov 23 '15 at 11:11
Though the code work perfectly all these times, today this error pops up Error in eval(expr, envir, enclos) : could not find function "read_csv" My code is exactly same – Klllmmm Dec 28 '15 at 00:57
library(dplyr) library(readr) library(tidyr) library(data.table) customer1 <- read_csv("C:\\Users\\Customer.csv") %>% mutate(date = as.POSIXct(date, "%d-%m-%Y %k:%M", tz = "Europe/Berlin")) Pls. help me on this? – Klllmmm Dec 28 '15 at 00:57

Combine two data sets using date & time ( Sales data with employee attendance)

1 Answers1