3

I'm using postgres 9.1 with tablefunc:crosstab

I have a table with the following structure:

CREATE TABLE marketdata.instrument_data
(
  dt date NOT NULL,
  instrument text NOT NULL,
  field text NOT NULL,
  value numeric,
  CONSTRAINT instrument_data_pk PRIMARY KEY (dt , instrument , field )
)

This is populated by a script that fetches data daily. So it might look like so:

| dt         | instrument        | field     | value |
|------------+-------------------+-----------+-------|
| 2014-05-23 | SGX.MiniJGB.2014U | PX_VOLUME | 1     |
| 2014-05-23 | SGX.MiniJGB.2014U | OPEN_INT  | 2     |

I then use the following crosstab query to pivot the table:

select dt, instrument, vol, oi 
FROM crosstab($$
    select dt, instrument, field, value 
    from marketdata.instrument_data 
    where field = 'PX_VOLUME' or field = 'OPEN_INT'
    $$::text, $$VALUES ('PX_VOLUME'),('OPEN_INT')$$::text
) vol(dt date, instrument text, vol numeric, oi numeric);

Running this I get the result:

| dt         | instrument        | vol | oi |
|------------+-------------------+-----+----|
| 2014-05-23 | SGX.MiniJGB.2014U | 1   | 2  |

The problem: When running this with lot of real data in the table, I noticed that for some fields the function was splitting the result over two rows:

| dt         | instrument        | vol | oi |
|------------+-------------------+-----+----|
| 2014-05-23 | SGX.MiniJGB.2014U | 1   |    |
| 2014-05-23 | SGX.MiniJGB.2014U |     | 2  |

I checked that the dt and instrument fields were identical and produced a work-around by grouping the ouput of the crosstab.

Analysis I've discovered that it's the presence of one other entry in the input table that causes the output to be split over 2 rows. If I have the input as follows:

| dt         | instrument        | field     | value |
|------------+-------------------+-----------+-------|
| 2014-04-23 | EUX.Bund.2014M    | PX_VOLUME | 0     |
| 2014-05-23 | SGX.MiniJGB.2014U | PX_VOLUME | 1     |
| 2014-05-23 | SGX.MiniJGB.2014U | OPEN_INT  | 2     |

I get:

| dt         | instrument        | vol | oi |
|------------+-------------------+-----+----|
| 2014-04-23 | EUX.Bund.2014M    | 0   |    |
| 2014-05-23 | SGX.MiniJGB.2014U | 1   |    |
| 2014-05-23 | SGX.MiniJGB.2014U |     | 2  |

Where it gets really weird...

If I recreate the above input table manually then the output is as we would expect, combined into a single row.

If I run:

update marketdata.instrument_data 
set instrument = instrument 
where instrument = 'EUX.Bund.2014M'

Then again, the output is as we would expect, which is surprising as all I've done is set the instrument field to itself.

So I can only conclude that there is some hidden character/encoding issue in that Bund entry that is breaking crosstab.

Are there any suggestions as to how I can determine what it is about that entry that breaks crosstab?

Edit: I ran the following on the raw table to try and see any hidden characters:

select instrument, encode(instrument::bytea, 'escape')  
from marketdata.bloomberg_future_data_temp 
where instrument = 'EUX.Bund.2014M';

And got:

| instrument     | encode         |
|----------------+----------------|
| EUX.Bund.2014M | EUX.Bund.2014M |
Jim
  • 1,040
  • 1
  • 10
  • 16
  • Did you try to compare `octet_length(string)` on both the script and manually entered strings? – Clodoaldo Neto Jul 18 '14 at 10:23
  • Just tried it: 14 for both. – Jim Jul 18 '14 at 10:34
  • Erwin: Good points thanks. I had updated the query as a result of Clodoaldo's comment but not the results. I've also cleaned up the table definition - the problem still persists. – Jim Jul 18 '14 at 15:27
  • The remark about `text` and `varchar` was just an aside. Consider my updated answer for a clean solution. BTW, your problems are reproducible just fine, consider removing that disclaimer. – Erwin Brandstetter Jul 18 '14 at 15:33

2 Answers2

3

Two problems.

1. ORDER BY is required.

The manual:

In practice the SQL query should always specify ORDER BY 1,2 to ensure that the input rows are properly ordered, that is, values with the same row_name are brought together and correctly ordered within the row.

With the one-parameter form of crosstab(), ORDER BY 1,2 would be necessary.

2. One column with distinct values per group.

The manual:

crosstab(text source_sql, text category_sql)
source_sql is a SQL statement that produces the source set of data.
...
This statement must return one row_name column, one category column, and one value column. It may also have one or more "extra" columns. The row_name column must be first. The category and value columns must be the last two columns, in that order. Any columns between row_name and category are treated as "extra". The "extra" columns are expected to be the same for all rows with the same row_name value.

Bold emphasis mine. One column. It seems like you want to form groups over two columns, which does not work as you desire.

Related answer:

The solution depends on what you actually want to achieve. It's not in your question, you silently assumed the function would do what you hope for.

Solution

I guess you want to group on both leading columns: (dt, instrument). You could play tricks with concatenating or arrays, but that would be slow and / or unreliable. I suggest a cleaner and faster approach with a window function rank() or dense_rank() to produce a single-column unique value per desired group. This is very cheap, because ordering rows is the main cost and the order of the frame is identical to the required order anyway. You can remove the added column in the outer query if desired:

SELECT dt, instrument, vol, oi
FROM   crosstab(
   $$SELECT dense_rank() OVER (ORDER BY dt, instrument) AS rnk
          , dt, instrument, field, value 
     FROM   marketdata.instrument_data 
     WHERE  field IN ('PX_VOLUME', 'OPEN_INT')
     ORDER  BY 1$$
 , $$VALUES ('PX_VOLUME'),('OPEN_INT')$$
   ) vol(rnk int, dt date, instrument text, vol numeric, oi numeric);

More details:

Erwin Brandstetter
  • 605,456
  • 145
  • 1,078
  • 1,228
0

You could run a query that replaces irregular characters with an asterisk:

select  regexp_replace(instrument, '[^a-zA-Z0-9]', '*', 'g')
from    marketdata.instrument_data
where   instrument = 'EUX.Bund.2014M'

Perhaps the instrument = instrument assignment discards trailing whitespace. That would also explain why where instrument = 'EUX.Bund.2014M' matches two values that crosstab sees as different.

Andomar
  • 232,371
  • 49
  • 380
  • 404
  • 1
    There's no trailing whitespace and the output of query is "EUX*Bund*2014M". As you say these things whouldn't be affected by instrument = instrument fix. – Jim Jul 18 '14 at 10:37
  • I'm not sure what you mean by "instrument = 'EUX.Bund.2014M' matches two values" though. there is only one entry for EUX.Bund in my cut-down text DB. – Jim Jul 18 '14 at 10:39
  • Your crosstab shows two rows with the same value (`2014-05-23|SGX.MiniJGB.2014U`). Perhaps try to find a query that matches both, like `select * from instrument_data where instrument like '%2014M%'` – Andomar Jul 18 '14 at 10:47