Get list of months from one table and counts for each from another

Question

I'm trying to pull this through in Postgres 11.8:

SELECT count(distinct e.id) counter_employees,
      (SELECT count(distinct id) FROM employees 
       WHERE date_trunc('month',date_hired) = period AND company = 11
      ) hires,
FROM employees e
WHERE period IN (SELECT DISTINCT make_date(...) FROM amounts)

I cant figure out how to declare that the period the subquery should check is outside the subquery. Also, the period is not from a table but generated, so there is not a column in amounts to relate to the employees inside the subquery.

employee table:

id     date_hired    company
431    2020-01-03    11
422    2020-01-02    11
323    2020-02-03    11

amounts table:

payment_period   amount   company
202001           999      11
202002           999      11

For every payment period in amounts I want to get some data such as employee count and hires of that period:

period    count  hires
202001    5      1
202002    6      ...

What are you actually trying to do? Please provide sample data and desired results. You query can probably be optimized. — GMB, May 16 '20 at 00:03
Postgres version, table definition, sample data, desired result. And a plain English explanation of your objective. — Erwin Brandstetter, May 16 '20 at 00:09
Employees never get fired? Or are those just deleted from the table? — Erwin Brandstetter, May 16 '20 at 00:22

GMB · Answer 1 · 2020-05-16T00:16:53.140

1

One option uses aggregation and window functions. If you have hires for each month, then you can get the information directly from employees, like so:

select
    date_trunc('month', date_hired) month_hired,
    sum(count(*)) over(order by date_trunc('month', date_hired)) no_employees,
    count(*) hires
from employees
group by date_trunc('month', date_hired)

On the other hand, if there are months without hires, then you could use generate_series() to create the list of months, then bring the employees with a left join, and aggregate:

select
    d.month_hired,
    sum(count(e.id)) over(order by d.month_hired) no_employees,
    count(e.id) hires
from (
    select generate_series(
        date_trunc('month', min(date_hired)), 
        date_trunc('month', max(date_hired)), 
        interval '1' month
    ) month_hired
    from employees
) d
left join employees e
    on  e.date_hired >= d.month_hired
    and e.date_hired <  d.month_hired + interval '1' month
group by d.month_hired

edited May 16 '20 at 00:16

answered May 16 '20 at 00:02

GMB

216,147
25
84
135

I like this solution because have experience working with series to fill data holes. However, the number of employees is given by the fact an employee either: has the end_date null or the end_date is later than the payment_period. Perhaps It's late here, been working 14 hours on this, but can't find a way to put this on your code. I sincerely apologize for any inconvenience and appreciate your help. – Gabriel May 16 '20 at 02:06
@GabrielA.Zorrilla: ah, that's a different question then, since your original data had no `end date`... Maybe you want to add this condition on the `left join`: `and (e.date_ended is null or e.date_ended >= d.month_hired)`. Also maybe you need to generate the date series from the paymet table rather than from the `employees` table. If that's not enough for you to work it out, then you might need to ask a new question, providing representative sample data and desired results... – GMB May 16 '20 at 02:13

Erwin Brandstetter · Answer 2 · 2020-05-16T01:21:30.293

We could run another count for every period distilled from amounts, but that's expensive - unless there are only very few?

For more than a few, compute counts per period for the whole employees table, plus a running total. Then LEFT JOIN to it, should be pretty efficient:

SELECT mon AS period, e.mon_hired AS count, e.all_hired AS hires
FROM  (
   SELECT to_date(payment_period, 'YYYYMM') AS mon
   FROM  (SELECT DISTINCT payment_period FROM  amounts) a0
   )  a
LEFT JOIN (
   SELECT date_trunc('month', date_hired) AS mon
        , count(*) AS mon_hired
        , sum(count(*)) OVER (ORDER BY date_trunc('month', date_hired)) AS all_hired
   FROM   employees e
   GROUP  BY 1
   ) e USING (mon)
ORDER  BY 1;

This assumes we can just count all employees hired so far to get the total number of hires. (Nobody ever gets fired.)

Works just fine as long as there are rows for every period. Else we need to fill in for the gaps. We can compute a complete grid, or default to the latest row in case of a missing month like this:

WITH e AS (
   SELECT date_trunc('month', date_hired) AS mon
        , count(*) AS mon_hired
        , sum(count(*)) OVER (ORDER BY date_trunc('month', date_hired)) AS all_hired
   FROM   employees e
   GROUP  BY 1
   )
SELECT mon AS period, ae.*
FROM  (
   SELECT to_date(payment_period, 'YYYYMM') AS mon
   FROM  (SELECT DISTINCT payment_period FROM  amounts) a0
   )  a
LEFT JOIN LATERAL (
   SELECT CASE WHEN e.mon = a.mon THEN e.mon_hired ELSE 0 END AS count  -- ①
        , e.all_hired AS hires
   FROM   e
   WHERE  e.mon <= a.mon
   ORDER  BY e.mon DESC
   LIMIT  1
   ) ae USING (mon)
ORDER  BY 1;

① If nothing changed for the month, we need to fall back to the last month with data. Take the total count from there, but the monthly count is 0.

We can run a window function over an aggregate on the same query level. See:

Group and count events per time intervals, plus running total

PostgreSQL: running count of rows for a query 'by minute'

Aside: don't omit the AS keyword for a column alias. See:

Date column arithmetic in PostgreSQL query

Get list of months from one table and counts for each from another

2 Answers2