another one to the #greatest-n-per-group party!
My previous code:
select count(*)
from revisions join files on rev_file = file_id
where rev_parent_id like 0
and rev_timestamp between '20011231230000' and '20191231225959'
and file_namespace like 0
and file_is_redirect like 0
The problem is, that for some files there are multiple entries with rev_parent_id=0. I would like to count only those with earliest rev_timestamp, but my attempts using the answers in SQL select only rows with max value on a column and Select Earliest Date and Time from List of Distinct User Sessions give me cca 9 000 and 11 000 000. The correct number should be cca 422 000. Perhaps I fail in joining three tables correctly, here is one of my attempts (that one with 9 000 results):
select count(r1.rev_file)
from revisions r1
left outer join revisions r2 on (r1.rev_file = r2.rev_file
and r1.rev_timestamp < r2.rev_timestamp)
join files on r1.rev_file = file_id
where r2.rev_file is NULL
and r1.rev_parent_id like 0
and r1.rev_timestamp between '20011231230000' and '20191231225959'
and file_namespace like 0
and file_is_redirect like 0
The table structures:
files
file_id, file_namespace, file_is_redirect
1234, 0, 0
1235, 3, 1
1236, 3, 0
revisions
rev_file, rev_id, rev_parent_id, rev_timestamp
1234, 19, 16, 20170302061522
1234, 16, 0, 20170302061428
1234, 14, 12, 20170302061422
1234, 12, 0, 20170302061237
1235, 21, 18, 20170302061815
1235, 18, 13, 20170302061501
1235, 13, 8, 20170302061355
1235, 8, 3, 20170302061213
1235, 3, 0, 20170302061002
1236, 6, 0, 20170302061014
file_id = rev_file = id of the file. file_namespace = mimetype of the file, 0 is plaintext. rev_id = id of the revision. rev_parent_id = id of the parent revision. rev_timestamp = timestamp of the revision
The only valid file is 1234 and it has been removed and recreated, so it has got two rev_parent_id=0 entries. I want to count the file only if the older rev_parent_id=0 revision was between selected times.