0

I am trying to join data from two completely different sources. One source contains an employee's schedule information, and the other tracks what they actually worked (like what time they actually took lunch or break). The problem is, the schedule program gives times as BREAK1, BREAK2, BREAK3, and LUNCH, while the tracking program simply lists them as Lunch and Break. I can join the data and get the lunches just fine, but the breaks are throwing me off. If I convert BREAK1, BREAK2, and BREAK3 to just "Break", I end up with too many segments because it is matching every instance with every other instance. Is there a way that anyone can think of to join these two pieces of information? Thank You.

EDIT At your request, here some sample data:

This is the Scheduled Times:

EMP_ID  NOM_DATE    SEG_CODE    START_MOMENT    STOP_MOMENT
626009  26-Sep-13   BREAK2          9/26/13 5:00 PM 9/26/13 5:15 PM
625650  26-Sep-13   BREAK2          9/26/13 4:30 PM 9/26/13 4:45 PM
638815  26-Sep-13   BREAK2          9/26/13 4:00 PM 9/26/13 4:15 PM
621649  26-Sep-13   BREAK2          9/26/13 3:30 PM 9/26/13 3:45 PM
567005  26-Sep-13   BREAK2          9/26/13 3:30 PM 9/26/13 3:45 PM
626009  26-Sep-13   LUNCH           9/26/13 2:30 PM 9/26/13 3:30 PM
625650  26-Sep-13   LUNCH           9/26/13 1:30 PM 9/26/13 2:30 PM
638815  26-Sep-13   LUNCH           9/26/13 1:30 PM 9/26/13 2:30 PM
621649  26-Sep-13   LUNCH          9/26/13 12:30 PM 9/26/13 1:30 PM
567005  26-Sep-13   LUNCH          9/26/13 12:30 PM 9/26/13 1:30 PM
626009  26-Sep-13   BREAK1         9/26/13 11:45 AM 9/26/13 12:00 PM
625650  26-Sep-13   BREAK1         9/26/13 11:30 AM 9/26/13 11:45 AM
638815  26-Sep-13   BREAK1         9/26/13 11:45 AM 9/26/13 12:00 PM
621649  26-Sep-13   BREAK1          9/26/13 9:30 AM 9/26/13 9:45 AM
567005  26-Sep-13   BREAK1  9/26/13 9:30 AM 9/26/13 9:45 AM

This is the Actual Times

EMP_ID  Seg_Code    Start_Time  Stop_Time
625650  Break           9/26/2013 17:54 9/26/2013 17:55
567005  Break           9/26/2013 14:56 9/26/2013 14:59
567005  Break           9/26/2013 15:32 9/26/2013 15:44
638815  Break           9/26/2013 16:34 9/26/2013 16:47
567005  Break           9/26/2013 10:08 9/26/2013 10:21
626009  Break           9/26/2013 17:01 9/26/2013 17:15
625650  Break           9/26/2013 11:31 9/26/2013 11:45
626009  Break           9/26/2013 11:52 9/26/2013 12:07
621649  Break           9/26/2013 9:34  9/26/2013 9:48
621649  Break           9/26/2013 15:31 9/26/2013 15:45
638815  Break           9/26/2013 11:46 9/26/2013 12:02
625650  Break           9/26/2013 16:35 9/26/2013 16:51
567005  Lunch           9/26/2013 12:31 9/26/2013 13:29
625650  Lunch           9/26/2013 13:31 9/26/2013 14:30
626009  Lunch           9/26/2013 14:31 9/26/2013 15:30
638815  Lunch           9/26/2013 13:31 9/26/2013 14:30
621649  Lunch           9/26/2013 12:31 9/26/2013 13:30

I am trying to get the difference (in minutes) between when they are scheduled, and when they are actually taking breaks. A correct example is:

Badge   Seg_Code    Scheduled Start     Scheduled Stop      Actual Start           Actual Stop      Difference      Seg_Duration
192329  Lunch       9/26/13 8:15 AM     9/26/13 9:15 AM     9/26/2013 8:18:27 AM    9/26/2013 9:17:59 AM        3       0:00:59:32

Thank you again

2 Answers2

0

While generally not a good idea in queries that run frequently, you can use string functions in join conditions. For example in MySQL it would look like this:

CREATE TABLE test1 (
  worktype VARCHAR(20)
);
INSERT INTO test1 VALUES ('BREAK1');
INSERT INTO test1 VALUES ('BREAK2');

CREATE TABLE test2 (
  worktype VARCHAR(20)
);
INSERT INTO test2 VALUES ('Break');

SELECT t1.worktype 't1', t2.worktype 't2'
FROM test1 t1 
JOIN test2 t2 ON LEFT(t1.worktype, LENGTH(t2.worktype)) = LOWER(t2.worktype);

Depending on your application this could be justified - for example if you run that once a week in batch job to copy over data from a source over which you have no control.

As for your second question: If the tables contain no other data which you can use in the join, there is no way to get rid of the problem where "Break" is joined with "BREAK1", and "BREAK2", etc. - in other words the entropy, or information density, is just to low. You have to come up with a "tie breaker" yourself to decide which of the rows ("BREAK1", "BREAK2") you want to show up. For example you could use this rule: "always use BREAK1 over BREAK2". There is not enough information in your post to come up with a rule for you.

Tilo
  • 3,255
  • 26
  • 31
  • Sorry for no making that clear. The EMP_ID field is also used to make sure that the correct employees are matched from each source. Thank you. – Dantalion88 Oct 01 '13 at 20:43
0

Assuming SQL Server:

;WITH Actual_Ranked AS
(
    SELECT
        ROWNUM = CASE Seg_Code
                WHEN 'Break' THEN CAST(ROW_NUMBER() OVER (PARTITION BY EMP_ID, Seg_Code, CAST(Start_Time AS DATE) ORDER BY Start_Time) AS VARCHAR(1))
                ELSE ''
            END,
        EMP_ID,
        Seg_Code,
        Start_Time,
        Stop_Time
    FROM
        #Actual
)
SELECT
    ISNULL(sched.EMP_ID, act.EMP_ID) AS Badge,
    ISNULL(sched.SEG_CODE, (act.SEG_CODE + ROWNUM)) AS Seg_Code,
    CONVERT(VARCHAR, sched.START_MOMENT, 22) AS [Scheduled Start],
    CONVERT(VARCHAR, sched.STOP_MOMENT, 22) AS [Scheduled Stop],
    CONVERT(VARCHAR, act.Start_Time, 22) AS [Actual Start],
    CONVERT(VARCHAR, act.Stop_Time, 22) AS [Actual Stop],
    DATEDIFF(minute, sched.START_MOMENT, act.Start_Time) AS [Difference]
FROM
    #Scheduled sched
    FULL JOIN Actual_Ranked act ON sched.EMP_ID = act.EMP_ID
         AND sched.SEG_CODE = (act.SEG_CODE + ROWNUM);

Replace table names as needed.

The key is that I'm getting the row number of the break and appending that to make the join work.

I didn't include Seg_Duration to keep the answer simple, but you can show the duration in your preferred format by looking at the answers to this question.

SQL Fiddle here.

Community
  • 1
  • 1
zimdanen
  • 5,508
  • 7
  • 44
  • 89
  • Thank you for the reply. When I run your code and change the tables, there are NULLs in the either the "Scheduled Start" and "Scheduled Stop" Columns, or the "Actual Start" and "Actual Stop" fields. I Think it is because for some reason it is converting the "Actual" fields into an 1899 date. I have tried to fix it, but can't see exactly where that is happening. I think it may be the CAST. Thank you again. – Dantalion88 Oct 01 '13 at 20:54
  • @Dantalion88: Using your sample data, there are two rows that will have `NULL`s, but they'll be only in the `Scheduled Start`, `Scheduled Stop`, and `Difference` columns. I'll put together a SqlFiddle to demonstrate the code. If you have data that's not working, I'll need to see it to help you fix it. – zimdanen Oct 01 '13 at 20:56
  • @Dantalion88: Added the SQL Fiddle. Feel free to change to include the data you're having an issue with so I can see it. – zimdanen Oct 01 '13 at 20:59
  • Thank you for your help, I figured it out. I gave some sample data that was actually already "fixed" a little. I added subqueries to modify the data to where it needs to be in your code, and it worked. Thanks Again! – Dantalion88 Oct 01 '13 at 22:07