1

I have long format data on businesses, with a row for each occurrence of a move to a different location, keyed on business id -- there can be several move events for any one business establishment.

I wish to reshape to a wide format, which is typically cross-tab territory per the tablefunc module.

+-------------+-----------+---------+---------+
| business_id | year_move |  long   |   lat   |
+-------------+-----------+---------+---------+
|   001013580 |      1991 | 71.0557 | 42.3588 |
|   001015924 |      1993 | 71.0728 | 42.3504 |
|   001015924 |      1996 | -122.28 | 37.654  |
|   001020684 |      1992 | 84.3381 | 33.5775 |
+-------------+-----------+---------+---------+

Then I transform like so:

SELECT longbyyear.*
FROM crosstab($$
    SELECT 
    business_id, 
    year_move, 
    max(longitude::float)
    from business_moves
    where year_move::int between 1991 and 2010 
    group by business_id, year_move
    order by business_id, year_move;
    $$
) 
AS longbyyear(biz_id character varying, "long91" float,"long92" float,"long93" float,"long94" float,"long95" float,"long96" float,"long97" float, "long98" float, "long99" float,"long00" float,"long01" float,
"long02" float,"long03" float,"long04" float,"long05" float, 
"long06" float, "long07" float, "long08" float, "long09" float, "long10" float);

And it --mostly-- gets me to the desired output.

+---------+----------+----------+----------+--------+---+--------+--------+--------+
| biz_id  |  long91  |  long92  |  long93  | long94 | … | long08 | long09 | long10 |
+---------+----------+----------+----------+--------+---+--------+--------+--------+
| 1000223 | 121.3784 | 121.3063 | 121.3549 | 82.821 | … |        |        |        |
| 1000678 | 118.224  |          |          |        | … |        |        |        |
| 1002158 | 121.98   |          |          |        | … |        |        |        |
| 1004092 | 71.2384  |          |          |        | … |        |        |        |
| 1007801 | 118.0312 |          |          |        | … |        |        |        |
| 1007855 | 71.1769  |          |          |        | … |        |        |        |
| 1008697 | 71.0394  | 71.0358  |          |        | … |        |        |        |
| 1008986 | 71.1013  |          |          |        | … |        |        |        |
| 1009617 | 119.9965 |          |          |        | … |        |        |        |
+---------+----------+----------+----------+--------+---+--------+--------+--------+

The only snag is that I would ideally have populated values for each year and not just have values in move years. Thus all fields would be populated, with a value for each year, with the most recent address carrying over to the next year. I could hack this with manual updates if each is blank, use the previous column, I just wondered if there was a clever way to do it either with the crosstab() function, or some other way, possibly coupled with a custom function.

Erwin Brandstetter
  • 605,456
  • 145
  • 1,078
  • 1,228
ako
  • 3,569
  • 4
  • 27
  • 38
  • 1
    There is no automatic way to do do this afaik because you want information that is not stored in the database. Effectively, you want to create _state_ information (where is my business located in a certain year?) from _event_ data (which year did my business move to a new longitude?). That can be done, but it will be ugly. Can you elaborate on what exactly you want to achieve? And, if you want to get a real workable answer, please post the table structure. – Patrick May 08 '14 at 04:33
  • @Patrick, the table structure is shown up top, the first three rows anyway--although with slightly different column names. But I am essentially after a lookup table for each business so that given a year, I can fetch the corresponding location (lat, long pair, either as 2d array, or in separate tables) from the table. – ako May 08 '14 at 04:54
  • So am I correct in assuming then that you want to answer a question like "where were my businesses located in year x?"? – Patrick May 08 '14 at 05:18
  • indeed, that sums it up--I realize storing the data for each year has tons of redundancy since most businesses don't move that much, yet the entire database is a set of wide format flat files keyed on business id providing various variables by year for each business (employment, industry code, etc). – ako May 08 '14 at 05:20
  • I assume you have a date for each `business_move` and not just the year? I also assume, you do not actually want the maximum `latitude` per year? And I finally assume you do not just want the `latitude`, but also the `longitude` to go with it? – Erwin Brandstetter May 11 '14 at 02:40
  • And can we further assume there is a `business` table with a complete list of unique businesses? – Erwin Brandstetter May 11 '14 at 02:57

2 Answers2

2

In order to get the current location for each business_id for any given year you need two things:

  1. A parameterized query to select the year, implemented as a SQL language function.
  2. A dirty trick to aggregate on year, group by the business_id, and leave the coordinates untouched. That is done by a sub-query in a CTE.

The function then looks like this:

CREATE FUNCTION business_location_in_year_x (int) RETURNS SETOF business_moves AS $$
  WITH last_move AS (
    SELECT business_id, MAX(year_move) AS yr
    FROM business_moves
    WHERE year_move <= $1
    GROUP BY business_id)
  SELECT lm.business_id, $1::int AS yr, longitude, latitude
  FROM business_moves bm, last_move lm
  WHERE bm.business_id = lm.business_id
  AND bm.year_move = lm.yr;
$$ LANGUAGE sql;

The sub-query selects only the most recent moves for every business location. The main query then adds the longitude and latitude columns and put the requested year in the returned table, rather than the year in which the most recent move took place. One caveat: you need to have a record in this table that gives the establishment and initial location of each business_id or it will not show up until after it has moved somewhere else.

Call this function with the usual SELECT * FROM business_location_in_year_x(1997). See also the SQL fiddle.

If you really need a crosstab then you can tweak this code around to give you the business location for a range of years and then feed that into the crosstab() function.

Patrick
  • 29,357
  • 6
  • 62
  • 90
  • This is perfect. It may be a big performance hit to run this query on millions of establishments (hence the initial storing the data for each year) yet this is far more parsimonious. – ako May 08 '14 at 18:59
2

I assume you have actual dates for each business move, so we can make meaningful picks per year:

CREATE TEMP TABLE business_moves (
  business_id int,  -- why would you use inefficient varchar here?
  move_date date,
  longitude float,
  latitude float);

Building on this, a more meaningful test case:

INSERT INTO business_moves VALUES 
  (001013580, '1991-1-1', 71.0557, 42.3588),
  (001015924, '1993-1-1', 71.0728, 42.3504),
  (001015924, '1993-3-3', 73.0728, 43.3504),  -- 2nd move this year
  (001015924, '1996-1-1', -122.28, 37.654),
  (001020684, '1992-1-1', 84.3381, 33.5775);

Complete, very fast solution

SELECT *
FROM crosstab($$
   SELECT business_id, year
        , first_value(x) OVER (PARTITION BY business_id, grp ORDER BY year) AS x
   FROM  (
      SELECT *
           , count(x) OVER (PARTITION BY business_id ORDER BY year) AS grp
      FROM  (SELECT DISTINCT business_id FROM business_moves) b
      CROSS  JOIN generate_series(1991, 2010) year
      LEFT   JOIN (
         SELECT DISTINCT ON (1,2)
                business_id
              , EXTRACT('year' FROM move_date)::int AS year
              , point(longitude, latitude) AS x
         FROM   business_moves
         WHERE  move_date >= '1991-1-1'
         AND    move_date <  '2011-1-1'
         ORDER  BY 1,2, move_date DESC
         ) bm USING (business_id, year)
      ) sub
   $$
   ,'VALUES
    (1991),(1992),(1993),(1994),(1995),(1996),(1997),(1998),(1999),(2000)
   ,(2001),(2002),(2003),(2004),(2005),(2006),(2007),(2008),(2009),(2010)'
    ) AS t(biz_id int
         , x91 point, x92 point, x93 point, x94 point, x95 point
         , x96 point, x97 point, x98 point, x99 point, x00 point
         , x01 point, x02 point, x03 point, x04 point, x05 point
         , x06 point, x07 point, x08 point, x09 point, x10 point);

Result:

 biz_id  |        x91        |        x92        |        x93        |        x94        |        x95        |        x96        |        x97        ...
---------+-------------------+-------------------+-------------------+-------------------+-------------------+-------------------+-------------------
 1013580 | (71.0557,42.3588) | (71.0557,42.3588) | (71.0557,42.3588) | (71.0557,42.3588) | (71.0557,42.3588) | (71.0557,42.3588) | (71.0557,42.3588) ...
 1015924 |                   |                   | (73.0728,43.3504) | (73.0728,43.3504) | (73.0728,43.3504) | (-122.28,37.654)  | (-122.28,37.654)  ...
 1020684 |                   | (84.3381,33.5775) | (84.3381,33.5775) | (84.3381,33.5775) | (84.3381,33.5775) | (84.3381,33.5775) | (84.3381,33.5775) ...

Step-by-step

Step 1

Repair what you had:

SELECT *
FROM crosstab($$
   SELECT DISTINCT ON (1,2)
          business_id
        , EXTRACT('year' FROM move_date) AS year
        , point(longitude, latitude) AS long_lat
   FROM   business_moves
   WHERE  move_date >= '1991-1-1'
   AND    move_date <  '2011-1-1'
   ORDER  BY 1,2, move_date DESC
   $$
   ,'VALUES
    (1991),(1992),(1993),(1994),(1995),(1996),(1997),(1998),(1999),(2000)
   ,(2001),(2002),(2003),(2004),(2005),(2006),(2007),(2008),(2009),(2010)'
   ) AS t(biz_id int
        , x91 point, x92 point, x93 point, x94 point, x95 point
        , x96 point, x97 point, x98 point, x99 point, x00 point
        , x01 point, x02 point, x03 point, x04 point, x05 point
        , x06 point, x07 point, x08 point, x09 point, x10 point);
  • You want lat & lon to make it meaningful, so form a point from both. Alternatively, you could just concatenate a text representation.

  • You may want even more data. Use DISTINCT ON instead of max() to get the latest (complete) row per year. Details here:
    Select first row in each GROUP BY group?

  • As long as there can be missing values for the whole grid, you must use the crosstab() variant with two parameters. Detailed explanation here:
    PostgreSQL Crosstab Query

  • Adapted the function to work with move_date date instead of year_move.

Step 2

To address your request:

I would ideally have populated values for each year

Build a full grid of values (one cell per business and year) with a CROSS JOIN of businesses and years:

SELECT *
FROM  (SELECT DISTINCT business_id FROM business_moves) b
CROSS  JOIN generate_series(1991, 2010) year
LEFT   JOIN (
   SELECT DISTINCT ON (1,2)
          business_id
        , EXTRACT('year' FROM move_date)::int AS year
        , point(longitude, latitude) AS x
   FROM   business_moves
   WHERE  move_date >= '1991-1-1'
   AND    move_date <  '2011-1-1'
   ORDER  BY 1,2, move_date DESC
   ) bm USING (business_id, year)
  • The set of years comes from a generate_series() call.

  • Distinct businesses from a separate SELECT. You might have a table of businesses, you could use instead (and cheaper)? This would also account for businesses that never moved.

  • LEFT JOIN to actual business moves per year to arrive at a full grid of values.

Step 3

Fill in defaults:

with the most recent address carrying over to the next year.

SELECT business_id, year
     , COALESCE(first_value(x) OVER (PARTITION BY business_id, grp ORDER BY year)
               ,'(0,0)') AS x
FROM  (
   SELECT *, count(x) OVER (PARTITION BY business_id ORDER BY year) AS grp
   FROM  (SELECT DISTINCT business_id FROM business_moves) b
   CROSS  JOIN generate_series(1991, 2010) year
   LEFT   JOIN (
      SELECT DISTINCT ON (1,2)
             business_id
           , EXTRACT('year' FROM move_date)::int AS year
           , point(longitude, latitude) AS x
      FROM   business_moves
      WHERE  move_date >= '1991-1-1'
      AND    move_date <  '2011-1-1'
      ORDER  BY 1,2, move_date DESC
      ) bm USING (business_id, year)
   ) sub;
  • In the subquery sub build on the query from step 2, form groups (grp) of cells that share the same location.

    For this purpose utilize the well known aggregate function count() as window aggregate function. NULL values don't count, so the value increases with every actual move, thereby forming groups of cells that share the same location.

  • In the outer query pick the first value per group for each row in the same group using the window function first_value(). Voilá.

  • To top it off, optionally(!) wrap that in COALESCE to fill the remaining cells with unknown location (no move yet) with (0,0). If you do that, there are no remaining NULL values, and you can use the simpler form of crosstab(). That's a matter of taste.

SQL Fiddle with base queries. crosstab() is not currently installed on SQL Fiddle.

Step 4

Use the query from step 3 in an updated crosstab() call.
All in all, this should be as fast as it gets. Indexes may help some more.

Community
  • 1
  • 1
Erwin Brandstetter
  • 605,456
  • 145
  • 1,078
  • 1,228
  • What do I have to do to install crosstab() on SQL Fiddle? Thanks Erwin – Jake Feasel May 11 '14 at 04:25
  • +4 if I could, for each step and all the detail which is now a great reference for me and hopefully others. – ako May 11 '14 at 05:41
  • 1
    Great post, as usual, and +1 for that, but a few remarks nonetheless: (1) Should you be making all the assumptions that you are (business_id as int, date of move instead of year)? The OP is obviously working with an existing database and the details provided on structure are probably what he has to work with. (2) I'd base the `generate_series()` on either the known year of first established business or a search thereof in the proper table (low end), and the current year (high end) in order to make this query New-Year-proof - this is unlikely to be a one-off. – Patrick May 11 '14 at 06:00
  • 1
    @Patrick: Assumptions are just what they are: assumptions. I may be wrong, but it's how it `should` be. I didn't have the patience to wait for the OP to clarify. The answer works for a `varchar` id just as well. Only change `biz_id int` -> `biz_id varchar. The bigger change is `move_date`. But easy to adapt as well. The answer is for the general public, not for the OP alone. But he seems to like it either way. – Erwin Brandstetter May 11 '14 at 12:49
  • 1
    @ErwinBrandstetter: Yes, true, but an unclear question is not going to invite the general public to read the answers. A `crosstab()` on longitude alone is of course blatantly wrong in concept, but even on a point it is very hard to believe that the OP **really** wants that. That I did enquire about through a comment and his clarifications led to my answer. (Still, I think your answer is great. Both in demonstrating concepts (e.g. use of `generate_series()` in prepping data for `crosstab()`) and in the immaculate presentation. Keep it up.) – Patrick May 11 '14 at 13:01
  • @JakeFeasel: `CREATE EXTENSION tablefunc;` [Details in this related answer](http://stackoverflow.com/questions/3002499/postgresql-crosstab-query/11751905#11751905). Would make your great site even more useful. :) Also have a peek at the [comments between Jack and me on dba.SE on this matter](http://dba.stackexchange.com/questions/55164/how-to-query-efficiently-from-postgres-to-select-special-words/55227#comment99284_55227). – Erwin Brandstetter May 11 '14 at 13:12
  • 1
    @Patrick: BTW, good idea to base the first year on the first record (or the latest before the time frame). A [`LATERAL JOIN`](http://stackoverflow.com/questions/19127630/insert-multiple-rows-in-one-table-based-on-number-in-another-table/19127999#19127999) comes to mind for that. The return type of `crosstab()` isn't flexible. But with the two-parameter form we can just "truncate" the resulting rows to fit a frame starting with a later year - and no NULL values remain. – Erwin Brandstetter May 11 '14 at 13:24