0

Telcos collect phone data based on impulses, be it cell phone towers or a change of the net for example. Thereby, unwanted duplicates are built and the task is to get them out, if the only point of difference might be a slight time difference or the presence of a cell tower. An example is given here:

This is an example of the cellphone data collected, date, Phone 1, and Phone2 are he same. What is different is sometimes the time and sometimes the presence of 'from_Tower' data.

SELECT '4/04/2020'  as [Date],  '7:03:46'   as Time,    '6123456789'    as Phone1,  '6987654321'as Phone2, '1' as From_Tower UNION
SELECT '4/04/2020'  as [Date],  '7:03:46'   as Time,    '6123456789'    as Phone1,  '6987654321'as Phone2, '' as From_Tower UNION
SELECT '4/04/2020'  as [Date],  '7:03:48'   as Time,    '6123456789'    as Phone1,  '6987654321'as Phone2, '5' as From_Tower UNION
SELECT '4/04/2020'  as [Date],  '7:03:48'   as Time,    '6123456789'    as Phone1,  '6987654321'as Phone2, ''as From_Tower UNION
SELECT '4/04/2020'  as [Date],  '7:03:48'   as Time,    '6123456789'    as Phone1,  '6987654321'as Phone2, '.'as From_Tower UNION
SELECT '4/04/2020'  as [Date],  '7:34:33'   as Time,    '6123456789'    as Phone1,  '6987654321'as Phone2,'' as From_Tower UNION
SELECT '4/04/2020'  as [Date],  '7:34:34'   as Time,    '6123456789'    as Phone1,  '6987654321'as Phone2,'' as From_Tower UNION
SELECT '4/04/2020'  as [Date],  '7:37:34'   as Time,    '6123456789'    as Phone1,  '6987654321'as Phone2,'' as From_Tower UNION
SELECT '4/04/2020'  as [Date],  '7:37:34'   as Time,    '6123456789'    as Phone1,  '6987654321'as Phone2 ,'-' as From_Tower 

How can I de-dup the data, that only 4 lines remain like:

SELECT '4/04/2020'  as [Date],  '7:03:46'   as Time,    '6123456789'    as Phone1,  '6987654321'as Phone2, '1' as From_Tower UNION
SELECT '4/04/2020'  as [Date],  '7:03:48'   as Time,    '6123456789'    as Phone1,  '6987654321'as Phone2, '5' as From_Tower UNION
SELECT '4/04/2020'  as [Date],  '7:34:34'   as Time,    '6123456789'    as Phone1,  '6987654321'as Phone2,'' as From_Tower UNION
SELECT '4/04/2020'  as [Date],  '7:37:34'   as Time,    '6123456789'    as Phone1,  '6987654321'as Phone2 ,'-' as From_Tower 

I could use a max(time)... where from_Tower is not null; How do I use the max function when there are hundreds of Phone1 numbers with duplicates. Would it be useful to use arrays for the time, and how could that be done? Is there any other, better way to do this in a clean way?

There are inspirational thoughts in the section below, which made me think, the best way to solve this might be a LAG function for the time, and whenever time is more than 2 seconds, the data is skipped. So I created this code:

       [CTE_dedup] as(

 SELECT [Date],[Time], [Phone1],  [Phone2], [From_Tower], 
  LAG([To_A])
      OVER (PARTITION BY [Time] ORDER BY [Date]) AS Previous_count, 
      [Time] - LAG([Time])
      OVER (PARTITION BY [Time]ORDER BY [Date] ) AS Timedifference
   
    FROM [CTE_set])


    select * from [CTE_dedup]
    WHERE Timedifference >2

But I get this error:

Operand data type time is invalid for subtract operator.

time is imported as:

time (time(7), null

why can I not subtract the time fields from each other?

Anna
  • 444
  • 1
  • 5
  • 23
  • You talk about de-duping, then you mention arrays. I'm confused about what you actually want. Please be clear about your expected results and the logic behind them – Charlieface Sep 09 '21 at 10:03
  • 1
    Does this answer your question>? [Get top 1 row of each group](https://stackoverflow.com/questions/6841605/get-top-1-row-of-each-group)? – Thom A Sep 09 '21 at 10:03
  • Could you please clarify again how two rows are considered duplicates of each other? Would a simple `SELECT MAX(tower) ... GROUP BY date, time, phone1, phone2` work? – Salman A Sep 09 '21 at 10:09
  • 1
    In this context is "time" the "time of day" or "call duration?" What are your rules that consider rows to be "nearly duplicate"? Can you explain why the row with Time `7:34:33` got removed? Can you explain why the row with Tower `.` got removed? – AlwaysLearning Sep 09 '21 at 10:47
  • @Anna . . . Your data has no `NULL` values in the `from_tower` column, so the question is rather confusing. Also, why is the date/time stored in separate columns when SQL Server supports multiple `datetime` types. – Gordon Linoff Sep 09 '21 at 11:19

1 Answers1

0

here is the answer: How do I convert time into an integer in SQL Server

      WITH [CTE_set_sceen] AS  
    (
   SELECT [Date],[Time], [Phone1],  [Phone2], [From_Tower] 
    ,CONVERT(INT, REPLACE(CONVERT(VARCHAR(8),  [Time_n] , 108), ':','')) AS [IntVersion]
    FROM [data] 
   ),

   [CTE_dedup] as(

 SELECT[Date],[Time], [Phone1], [Phone2], [From_Tower], 
     [IntVersion] - LAG([IntVersion])
      OVER (PARTITION BY [Date] ORDER BY [Time] ) AS Timedifference

    FROM [CTE_set_sceen]
)


    select * from [CTE_dedup]
    WHERE Timedifference >2
Anna
  • 444
  • 1
  • 5
  • 23