1

another one to the #greatest-n-per-group party!

My previous code:

select count(*)
  from revisions join files on rev_file = file_id
 where rev_parent_id like 0
   and rev_timestamp between '20011231230000' and '20191231225959'
   and file_namespace like 0
   and file_is_redirect like 0

The problem is, that for some files there are multiple entries with rev_parent_id=0. I would like to count only those with earliest rev_timestamp, but my attempts using the answers in SQL select only rows with max value on a column and Select Earliest Date and Time from List of Distinct User Sessions give me cca 9 000 and 11 000 000. The correct number should be cca 422 000. Perhaps I fail in joining three tables correctly, here is one of my attempts (that one with 9 000 results):

select count(r1.rev_file) 
  from revisions r1
  left outer join revisions r2 on (r1.rev_file = r2.rev_file
                              and r1.rev_timestamp < r2.rev_timestamp) 
  join files on r1.rev_file = file_id 
 where r2.rev_file is NULL
   and r1.rev_parent_id like 0 
   and r1.rev_timestamp between '20011231230000' and '20191231225959' 
   and file_namespace like 0
   and file_is_redirect like 0

The table structures:

files
file_id, file_namespace, file_is_redirect
1234, 0, 0
1235, 3, 1
1236, 3, 0

revisions
rev_file, rev_id, rev_parent_id, rev_timestamp
1234, 19, 16, 20170302061522
1234, 16, 0, 20170302061428
1234, 14, 12, 20170302061422
1234, 12, 0, 20170302061237
1235, 21, 18, 20170302061815
1235, 18, 13, 20170302061501
1235, 13, 8, 20170302061355
1235, 8, 3, 20170302061213
1235, 3, 0, 20170302061002
1236, 6, 0, 20170302061014

file_id = rev_file = id of the file. file_namespace = mimetype of the file, 0 is plaintext. rev_id = id of the revision. rev_parent_id = id of the parent revision. rev_timestamp = timestamp of the revision

The only valid file is 1234 and it has been removed and recreated, so it has got two rev_parent_id=0 entries. I want to count the file only if the older rev_parent_id=0 revision was between selected times.

aleskva
  • 1,644
  • 2
  • 21
  • 40
  • update your question add that the related tables schema a proper data sample and the expected result – ScaisEdge Feb 18 '19 at 13:40
  • I thought the names of fields make it obvious, but OK, I edited the question. Is there any option to add tables? There is nothing in https://stackoverflow.com/editing-help – aleskva Feb 18 '19 at 13:53

3 Answers3

1

First, let's use a subquery to locate the earliest timestamps in revisions for each rev_file, meeting your criteria.

          SELECT MIN(rev_timestamp) rev_timestamp, rev_file
            FROM revisons
           WHERE rev_parent_id like 0 
             AND rev_timestamp between '20011231230000' and '20191231225959' 
           GROUP BY rev_file

This gives you a virtual table with the earliest timestamp for each file matching your criteria.

Next, join that table to your other tables like this

SELECT COUNT(*) count
  FROM revisions r1
  JOIN (
          SELECT MIN(rev_timestamp) rev_timestamp, rev_file
            FROM revisons
           WHERE rev_parent_id like 0 
             AND rev_timestamp between '20011231230000' and '20191231225959' 
           GROUP BY rev_file
       ) rmin ON r1.rev_timstamp = rmin.rev_timestamp
             AND r1.rev_file = rmin.rev_file
  JOIN files f ON r1.rev_file = file_id
   and f.file_namespace like 0
   and f.file_is_redirect like 0            

Pro tip: Formatting your queries so they are readable is always worth the trouble.

Pro tip: Use COUNT(*) rather than COUNT(col) where possible. It's faster. And, unless the col you mention potentially contains NULL values, it yields the same result. That's not the case for queries in the question.

Pro tip: always qualify your columns in JOIN operations (f.file_is_redirect rather than file_is_redirect). Again, readability of queries is the motivation. If you're fortunate enough to have your code maintained by someone else one day, that person will be happy to see this. That's a vital part of "professional and enthusiast" programming.

Pro tip: numeric_col LIKE 0 kills performance. It is for matching text (column LIKE '%verflo' matches Stack Overflow). When you use LIKE on a numeric column it coerces the data type of each column to a string, and then runs the LIKE operator on it, defeating the use of any index you have on the numeric column.

O. Jones
  • 103,626
  • 17
  • 118
  • 172
  • Pro tip 2: I know, I used count(r1.rev_file) because count(r1.*) does not work Pro tip 3: files and revisions have no similarly named field by design, that's why I don't need to bother here Pro tip 4: I understand, but numeric_column like 0 is better readable in sentence-like languages like SQL and it works the same as = most of the time (not in join on) – aleskva Feb 18 '19 at 15:07
  • Please see my edits for enlarged reasons for my pro tips. – O. Jones Feb 18 '19 at 15:29
1

You should join the subquery for min rev_timestamp for rev_file

    select count(*) 
    from revisions 
    join files on rev_file = file_id 
    join  (

        select rev_file, min(rev_timestamp) min_time
        from revisions
        where rev_parent_id = 0 
        group  by rev_file

    ) t on t.min_time  = revisions.rev_timestamp 
            and t.rev_file = revisions.rev_file
    where rev_parent_id like 0 
    and rev_timestamp between '20011231230000' and '20191231225959' 
    and file_namespace like 0 
    and file_is_redirect like 0
aleskva
  • 1,644
  • 2
  • 21
  • 40
ScaisEdge
  • 131,976
  • 10
  • 91
  • 107
0

Thank you guys @scaisedge and @o-jones, finally I used the core of both of your answers and removed redundant code and this is what worked for me in the end:

select count(*)
  from (select rev_file, min(rev_timestamp) rev_timestamp from revision where rev_parent_id like 0 group by rev_file) revision
  join file on rev_file = file_id
 where rev_timestamp between '20011231230000' and '20191231225959'
   and file_namespace like 0
   and not file_is_redirect;

Perhaps I could also spare some runtime by moving file_namespace and file_is_redirect conditions into another subquery in join, but maybe not, I'm not sure.

The scaisedge answer is more brief and better readable, therefore I understood it immediately and prefer it. scaisedge just had some mistakes in the code (fixed by me). The o-jones answer is more cluttered with unnecessary stuff, but it is more detailed in case any reader would need explanations and thanks to tips to improve I learned some timing issues with my code.

aleskva
  • 1,644
  • 2
  • 21
  • 40