Calculating the longest uninterrupted speech duration

Question

I want to calculate the longest duration of uninterrupted speech for both speakers. Data is stored as XML in a MS SQL db.

So far, I've converted the xml into table (in MSSQL) which outputs the result like this.. the generated table is chronologically ordered (time in ms - stored as int).

speaker  duration    time
 1      480           0
 2      100           0
 2      260         100
 2      200         360
 1      2640        480
 2      280         560
 ..     ..          ..

Next, I want to iterate through the result with the following logic:

If the speaker id is the same, keep adding duration
if the speaker changes, compare the current speaker overall time with some global variable (update the variable if new speak time is bigger)
apply the same logic for the other speaker.

I need to do this in SQL - i'm new to writing conditional SQL. I don't know how to go about doing it.. :(

How is the table ordered, is there a datetime field? Note that there's no _inherent order_ of a table. — Tim Schmelter, Jan 10 '13 at 11:52
You need another column. Tables have no inherent order, so we can't tell which row comes first, second, etc. — Damien_The_Unbeliever, Jan 10 '13 at 11:53
The new table I generate (as shown above) is chronologically ordered. — user1964834, Jan 10 '13 at 11:53
No, we need a column (or set of columns) by which we (and SQL Server) can say which row came first. I thought you'd added that, but now see that you have two rows with `time` 0, so that's not what we need. — Damien_The_Unbeliever, Jan 10 '13 at 12:03

Damien_The_Unbeliever · Accepted Answer · 2013-01-10T13:50:12.837

You get SQL Server to do the required accumulation, once you have some way of ordering the rows (here, I've added ord):

declare @t table (speaker int not null,duration int not null,ord int not null)
insert into @t (speaker,duration,ord) values
 (1,      480,1),
 (2,      100,2),
 (2,      260,3),
 (2,      200,4),
 (1,      2640,5),
 (2,      280,6)

 ;with Merged as (
    select speaker,duration,ord,ord as last
    from @t t1
    where not exists(
          select * from @t t2
          where t1.speaker = t2.speaker and t1.ord = t2.ord + 1)
    union all
    select m.speaker,m.duration+t.duration,m.ord,t.ord
    from Merged m
    inner join @t t on m.speaker = t.speaker and m.last = t.ord - 1
), Final as (
    select speaker,duration,ord,last,
        ROW_NUMBER() OVER (PARTITION BY ord ORDER by last desc) as rn
    from Merged
)
select * from Final where rn = 1 order by duration desc

Result:

speaker     duration    ord         last        rn
----------- ----------- ----------- ----------- --------------------
1           2640        5           5           1
2           560         2           4           1
1           480         1           1           1
2           280         6           6           1

So speaker 1 had the longest single duration at 2640, speaker 2 came second with 560, etc.

The above query uses two Common Table Expressions (CTE). In the first (Merged), we define one recursively. The first part of the query finds rows for which there isn't an immediately preceding row with the same speaker (so, logically, these are the first rows of each unbroken section of speech for the speaker).

In the recursive portion, we add the next row if it belongs to the same speaker, and we keep track (in last) of which row we added last. This recursive part runs as many times as needed to accumulate the unbroken sections.

Unfortunately, the set that Merged produces also includes all of the intermediate steps we took in building up the uninterrupted speech. So in Final, assign a ROW_NUMBER() which allows us to easily find the last output for each row that was part of the initial set produced by Merged. And so the final query just selects those rows.

If you don't have a convenient column like ord, as I do above (monotonically increasing), you can simply generate such a column using another CTE, and whatever columns you do have that uniquely order the rows(*). So if you can uniquely identify rows by a column called time, you could place this CTE as the first one:

;WITH StrictOrdered as (
    SELECT speaker,duration,
       ROW_NUMBER() OVER (ORDER BY time) as ord
    FROM YourTable
)

And then replace all uses of @t in the remainder of my query with StrictOrdered.

(* Your updated example's time doesn't fit this requirement)

To get just the highest value for each speaker, replace:

select * from Final where rn = 1 order by duration desc

with:

, Highest as (
   select *,ROW_NUMBER() OVER (PARTITION BY Speaker ORDER BY duration desc) as rnDuration
   from Final where rn = 1
)
select * from Highest where rnDuration = 1

Thanks a million for your help. greatly appreciated. I however only need the top row for both speaker, not all the values it currently returns. How do I do that? I tried using Distinct, but that won't do it. Also, it seems that the table final is temporary, so i can't run the select statement twice. — user1964834, Jan 10 '13 at 13:24
@user1964834 - I've added it at the bottom of my answer - we introduce another CTE that takes the result from `Final` and runs another `ROW_NUMBER()` over it, to find the top result for each speaker. (Although if you think it through, we could *replace* `Final` with this query, since the longest result for each speaker must, logically, be one of the final results that `Final` was locating for us) — Damien_The_Unbeliever, Jan 10 '13 at 13:41
I'm reading through the answer. I didn't know something like CTE exists - that thing is amazing :) — user1964834, Jan 10 '13 at 13:49
@user1964834 - I've added a link to the MSDN documentation for them, if you want to read more about them. — Damien_The_Unbeliever, Jan 10 '13 at 13:50
I took your suggestion to add a unique row number. The script works as you would expect. However, in some cases, the script crashes due to recursion exhaustion - Msg 530, Level 16, State 1, Line 19 The statement terminated. The maximum recursion 100 has been exhausted before statement completion. — user1964834, Jan 10 '13 at 15:00
@user1964834 - you can add [`OPTION (MAXRECURSION 0)`](http://msdn.microsoft.com/en-gb/library/ms181714.aspx) to the end of the last `SELECT` and it won't limit recursion. — Damien_The_Unbeliever, Jan 10 '13 at 15:04

Gordon Linoff · Answer 2 · 2013-01-11T15:17:17.843

This is another approach to this problem.

As with Damien's solution, you need to add a sequence number to the data, because SQL tables are inherently unordered sets so you need a column to define any ordering. I'll call this ord (although I usually just use id). A typical way to generate this is to have statements such as:

create table as (. . .
    ord int identity(1,1),
    . . .
)

create view v_table as
    select <everything but ord>
    from table

Then you can insert or bulk insert into the view, and the ord column is created automatically.

For each speaker, I want to group together the consecutive rows, by assigning them a value. The value I'll chose is the "ord" value of the row following them:

1   480  1   2
2   100  2   5
2   260  3   5
2   200  4   5
1   2640 5   . . .

On this result, I group by the last column, take the sum of the durations and choose the maximum duration.

The challenge is getting that fourth column. For this, I'll use a correlated subquery. The following puts this all together:

select top 1 speaker, sum(duration)
from (select t.*,
             (select min(ord) from t t2 where t2.speaker <> t.speaker and t2.ord > t.ord
             ) as GroupingValue
      from t
     ) t
group by GroupingValue, speaker
order by sum(duration) desc

To get the top duration for each speaker, you can simply use another window function, row_number(). This requires just one more level of subqueries, though, for which I'm using a CTE:

with SpeakerDur as (
     select speaker, sum(duration) as GroupedDuration
     from (select t.*,
                  (select min(ord) from t t2 where t2.speaker <> t.speaker and t2.ord > t.ord
                  ) as GroupingValue
           from t
          ) t
     group by GroupingValue, speaker
    )
select *
from (select sd.*,
             row_number() over (partition by speaker order by GroupedDuration desc) as seqnum
      from SpeakerDur sd
     ) sd
where seqnum = 1

The row_number() assigns the sequential numbers 1, 2, 3, etc. to the duration for each speaker (partition by speaker) starting with the longest duration (order by GroupedDuration desc). It then chooses the top value. If you want the top five, you can change the where clause to seqnum <= 5, for instance.

I don't have a huge SQL background - but by the looks of this query, it will execute faster than damien's solution, but I might be completely wrong. In your solution, how do I get the top duration for both speakers? — user1964834, Jan 11 '13 at 10:58

Calculating the longest uninterrupted speech duration

2 Answers2