3

I know there are many topics regarding this question but none actually helped me solve my problem. I am still sort of new when it comes to databases and I came across this problem. I have a table named tests which contains two columns: id and date. I want to calculate the average difference of days between a couple values. Say select date from tests where id=1 which will provide me with a list of dates. I want to calculate the avg difference between those days.

Table "tests"

1|2018-03-13
1|2018-03-01
2|2018-03-13
2|2018-03-01
3|2018-03-13
3|2018-03-01
1|2018-03-17
2|2018-03-17
3|2018-03-17

Select date from tests where id=1

2018-03-13
2018-03-01
2018-03-17

Now I am looking to calculate the average difference in days between those three dates. Can really use some help, thank you!

Edit: Sorry for being unclear, I'll clarify my question. So student one had a test on the 01/03, then on the 13/03 and then on the 17/03. What I want to calculate is the avg difference in days between test to test, so: Diff between first to second is 12 days. Diff between second to third is 4 days. 12+6 divided by two since we have two gaps is 8 eight.

Tomer
  • 79
  • 11
  • 1
    Please define what "average difference some dates" means. I would understand "average difference to a given date". For three dates there exist three differences, for four dates there are six differences, for five there are ten differences. – Yunnosch Mar 17 '18 at 18:20
  • 1
    Please show the desired output for some more sample inputs. – Yunnosch Mar 17 '18 at 18:20
  • For the example (13,1,17), the differences are 12,16,4. Correct? The average would be almost 11. – Yunnosch Mar 17 '18 at 18:22
  • Sorry, I edited the question. Thanks! – Tomer Mar 17 '18 at 18:46

1 Answers1

1
  • I am looking to calculate the average difference in days between those three dates.*

And by average difference we mean "take the average of the absolute value of the difference between all dates". That's 12 + 16 + 4 / 3 or 10.6667.

We need all combinations of dates. For this we need a self-join with no repeats. That's accomplished by picking a field and using on with a < or >.

select t1.date, t2.date
from tests as t1
join tests as t2 on t1.id = t2.id and t1.date < t2.date
where t1.id = 1;

2018-03-01|2018-03-13
2018-03-01|2018-03-17
2018-03-13|2018-03-17

Now that we have all combinations, we can take the difference. But not by simply subtracting the dates, SQLite doesn't support that. First, convert them to Julian Days.

sqlite> select julianday(t1.date), julianday(t2.date) from tests as t1 join tests as t2 on t1.id = t2.id and t1.date < t2.date where t1.id = 1;
2458178.5|2458190.5
2458178.5|2458194.5
2458190.5|2458194.5

Now that we have numbers we can take the absolute value of the difference and do an average.

select avg(abs(julianday(t1.date) - julianday(t2.date)))
from tests as t1
join tests as t2 on t1.id = t2.id and t1.date < t2.date
where t1.id = 1;

UPDATE

What I want to calculate is the avg difference in days between test to test, so: Diff between first to second is 12 days. Diff between second to third is 4 days. Then (12+4)/2=8 which should be the result.

For this twist on the problem you want to compare each row with the next one. You want a table like this:

2018-03-01|2018-03-13
2018-03-13|2018-03-17

Other databases have features like window or lag to accomplish this. SQLite doesn't have that. Again, we'll use a self-join, but we have to do it per row. This is a correlated subquery.

select t1.date as date, (
    select t2.date
    from tests t2
    where t1.id = t2.id and t2.date > t1.date
    order by t2.date
    limit 1
) as next
from tests t1
where id = 1
  and next is not null

The subquery-as-column finds the next date for each row.

This is a bit unwieldy, so let's turn it into a view. Then we can use it as a table. Just take out the where id = 1 so it's generally useful.

create view test_and_next as
    select t1.id, t1.date as date, (
        select t2.date
        from tests t2
        where t1.id = t2.id and t2.date > t1.date
        order by t2.date
        limit 1
    ) as next
    from tests t1
    where next is not null

Now we can treat test_and_next as a table with the columns id, date, and next. Then it's the same as before: turn them into Julian Days, subtract, and take the average.

select avg(julianday(next) - julianday(date))
from test_and_next
where id = 1;

Note that this will go sideways when you have two rows with the same date: there's no way for SQL to know which is the "next" one. For example, if there were two tests for ID 1 on "2018-03-13" they'll both choose "2018-03-17" as the "next" one.

2018-03-01|2018-03-13
2018-03-13|2018-03-17
2018-03-13|2018-03-17

I'm not sure how to fix this.

Schwern
  • 153,029
  • 25
  • 195
  • 336
  • Hey, thanks for the response. What I am looking to get is the diff in days between the dates, not the average date of the 3. – Tomer Mar 17 '18 at 18:25
  • @Tomer The difference between each date and each other date? (Six results) Or the difference between each and the average? (Three results) An example result would help clarify. – Schwern Mar 17 '18 at 18:26
  • Sorry, I'll clarify myself. So, if we sort the dates by date we get the following: 01/03,13/03,17/03. The diff between the first and second date is 12 days and the diff between the second and third is 4. Then (12+4)/2=8 which should be the result. – Tomer Mar 17 '18 at 18:29
  • @tomer I think you missed the diff between the first and third, which is 16. (12 + 4 + 16) / 3 is 10.667? – Schwern Mar 17 '18 at 18:42
  • I just edited the question to make it more clear. "Sorry for being unclear, I'll clarify my question. So student one had a test on the 01/03, then on the 13/03 and then on the 17/03. What I want to calculate is the avg difference in days between test to test, so: Diff between first to second is 12 days. Diff between second to third is 4 days. 12+6 divided by two since we have two gaps is 8 eight." – Tomer Mar 17 '18 at 18:47
  • @Tomer I still think either your math is wrong by failing to compare the first and third, or you have some special case in mind. Either way, I've updated my answer. You should be able to take it from there. – Schwern Mar 17 '18 at 18:56
  • I really do appreciate the help but I can not think of a proper solution for my problem with this, thank you though! – Tomer Mar 17 '18 at 19:41
  • @Tomer I've updated it with a solution that does what you need. – Schwern Mar 17 '18 at 22:22
  • Thank you so much for your help. Considering Sqlite3 can not completely provide me with what I need I perform the calculations on my server-side program. My question is, will calculating it through Python for example be less efficient then calculating it in the DB? – Tomer Mar 18 '18 at 16:11
  • @Tomer It depends on how much data there is and, to a lesser extent, what the calculation is. If there's a lot then the overhead of retrieving it all and storing it in memory will dominate, plus having to iterate through it all. If there's not much then the retrieval and iteration cost will be minimal. You'll have to weigh the convenience of doing it in Python vs the potential size of the data. If it's the data for just the tests one student has taken, that sounds like very little data. – Schwern Mar 18 '18 at 23:47