5

I have a database ordered like so:

    ID     |    Subject     |      Value
 ---------------------------------------------
     1            Subj1            Val1
     1            Subj2            Val2
     2            Subj1            Val3
     2            Subj5            Val4

And so on. IDs number in the thousands, and the subjects in tens of thousands.

I want to find the transpose of this matrix. How do I do that?

If the subjects were a small, static set, then I could use CASE statements as in this solution: SQLITE - transposing rows into columns properly, but in my situation, subjects are a large, dynamic set, so CASE won't work unless I'm building the SQL dynamically in some application outside the database. In my case, there is no "application"; as I mention in a comment below, I'm looking for a pure-SQL solution.

Here's the kicker: I'm using SQLite, and it's missing the PIVOT statement that would make this easier. I don't know if there's still a way to do this, and haven't been taught much past types of joins. For a smaller database I would have tried a series of left joins, but since I have so many column values in my result I don't know how.

How do I convert to this form?

      ID      |     Subj1       |        Subj2      |    Subj3     |       etc.
   --------------------------------------------------------------------
      1              Val1                  Val2            0        
      2              Val3                   0              0 
stevegt
  • 1,644
  • 20
  • 26
tvishwa107
  • 301
  • 2
  • 14
  • Why the -1? What was wrong with the question? – tvishwa107 Mar 27 '17 at 04:27
  • This is a valid question -- without PIVOT, sqlite makes this hard when the number of resulting columns is dynamic. For how this works in another db engine, see https://stackoverflow.com/a/27532568/1264797 – stevegt Jun 30 '17 at 20:39

2 Answers2

3

SQLite (at least as of 3.31) out-of-the-box doesn't have PIVOT, CROSSTAB or any other similar functionality. More generally it doesn't have a means to produce columns dynamically (and the resort is manual creation of CASE expressions to define columns). But there's pivot_vtab virtual table extension which implements a pivot table functionality. I guess it's sort of proof-of-concept implementation, but still can be useful depending on the circumstances.

Here's the long description from the repo (redacted so it fits a snippet):

CREATE VIRTUAL TABLE pivot USING pivot_vtab(
  --
  -- Pivot table row key query. Defines first column of the pivot table.
  -- [...]
  -- The first column name in this query will become the name of the pivot table key 
  -- column. The value of the [...] key column is provided to the pivot query as ?1.
  --
 (SELECT id r_id -- pivot table key
    FROM r), 
  --
  -- Pivot table column definition query. Defines second+ column(s) of the pivot
  -- table. This query should return pivot table column key/name pairs.
  -- [...]
  -- The first column of this query is the pivot column key, and is provided
  -- to the pivot query as ?2. The second column of this query is used to name the  
  -- pivot table columns. This column is required to return unique values.
  -- 
  -- Changes to this query can only be propagated by dropping and 
  -- re-creating the virtual table
  --
 (SELECT id c_id,   -- pivot column key - can be referenced in pivot query as ?2
         name       -- pivot column name
    FROM c),    
  --
  -- Pivot query. This query should define a single value in the pivot table when
  -- filtered by the pivot table row key (1?) and a column key (2?)
  --
 (SELECT val FROM x WHERE r_id = ?1 AND c_id = ?2)
);

While I was playing with it I had a couple of issues:

  • with invalid CREATE VIRTUAL TABLE ... it can crash with segmentation fault (e.g. on missing table)
  • the only error message you get out of it is vtable constructor failed: to_be_table
  • it expects separate table with unique pivot column id-name pairs (were character case makes a difference), and wouldn't create a pivot otherwise

I'll use this GitHub Gist with a CSV with salary data.

Here I prepare the data. Import the CSV into :memory: SQLite3 database, load the extension and clean the data (I load the records of software engineers at FAANG into a temporary table).

.mode csv 
.import salaries.csv salary_import

.load ./pivot_vtab
.headers on
.mode column

CREATE TABLE temp.salary AS
WITH clean AS (
  SELECT 
    Employer employer, 
    lower(trim("Job Title")) title, 
    replace("Annual Base Pay", ',', '') base_pay
  FROM salary_import
)
SELECT employer, title, round(avg(base_pay)) avg_base_pay, COUNT(*) count
FROM clean
WHERE
  employer IN ('Facebook', 'Amazon', 'Apple', 'Netflix', 'Google')
  AND title LIKE '%software engineer%'
GROUP BY 1, 2;

And here it's actual pivot creation (it's a temporary table in my example, but it can be persistent just as well).

CREATE VIRTUAL TABLE temp.pivot USING pivot_vtab(
  (SELECT employer FROM temp.salary GROUP BY employer),
  (SELECT title, title FROM temp.salary GROUP BY title),   
  (
    SELECT avg_base_pay
    FROM temp.salary
    WHERE employer = ?1 AND title = ?2
  )
);
SELECT * FROM temp.pivot;

To run it, save these files in a directory:

  • salary.sql (the two snippets above combined)
  • salaries.csv (the CSV from the Gist)
  • pivot_vtab.c (the code of the extension)

Then run it like this (chmod o+w . if you run user-namespaced Docker):

$ docker run --rm -it -v $PWD:/tmp/build -w /tmp/build ubuntu:focal
# apt update
# apt install -y --no-install-recommends gcc sqlite3 libsqlite3-dev
# sqlite3 --version
3.31.1 2020-01-27 19:55:54 3bfa9cc...
# gcc -g -O3 -fPIC -shared pivot_vtab.c -o pivot_vtab.so
# cat salary.sql | sqlite3 --bail
employer    data engineer/software engineer  senior software engineer  software eng...
----------  -------------------------------  ------------------------  ------------...
Amazon                                                                 130000.0    ...
Apple       145000.0                         155500.0                  122667.0    ...
Facebook                                     182000.0                  166690.0    ...
Google                                       158267.0                  131465.0    ...
Netflix                                      340000.0                              ...
saaj
  • 23,253
  • 3
  • 104
  • 105
  • Note that as of commit [49d96600](https://github.com/jakethaw/pivot_vtab/commit/49d96600) the 3 issues I mentioned above have been address by the author of `pivot_vtab`. – saaj Jun 14 '21 at 20:52
1

You can use conditional aggregation for pivoting your data:

select id,
    max(case when Subject = 'Subj1' then Value end) as Subj1,
    max(case when Subject = 'Subj2' then Value end) as Subj2,
    max(case when Subject = 'Subj3' then Value end) as Subj3,
    . . . 
from your_table
group by id;

Note here that if there are multiple rows with same id and same Subject, only one row with max value will be returned for the id.

Gurwinder Singh
  • 38,557
  • 6
  • 51
  • 76
  • I understand the idea: but how do I write cases when I have 20000-ish values for subject? Is there some query that lets me aggregate? – tvishwa107 Mar 27 '17 at 04:00
  • @tvishwa107 - It's odd to have *20000* subjects. Anyhow, If you have too many subjects, then consider handling this in your application code. – Gurwinder Singh Mar 27 '17 at 04:02
  • Can't - this is a database question, there's no application involved. – tvishwa107 Mar 27 '17 at 04:08