0

I have some data that looks like this:

UserID   Category
------   --------
1         a
1         b
2         c
3         b
3         a
3         c

A I'd like to binary-encode this grouped by UserID: three different values exist in Category, so a binary encoding would be something like:

UserID    encoding
------    --------
1         "1, 1, 0"
2         "0, 0, 1"
3         "1, 1, 1"

i.e., all three values are present for UserID = 3, so the corresponding vector is "1, 1, 1".

Is there a way to do this without doing a bunch of CASE WHEN statements? There may be dozens of possible values in Category

2 Answers2

1

Cross join the distinct users to distinct categories and left join to the table.
Then use GROUP_CONCAT() window function which supports an ORDER BY clause, to collect the 0s and 1s:

WITH 
  users AS (SELECT DISTINCT UserID FROM tablename),
  categories AS (
    SELECT DISTINCT Category, DENSE_RANK() OVER (ORDER BY Category) rn
    FROM tablename
  ),
  cte AS (
    SELECT u.UserID, c.rn,
           '"' || GROUP_CONCAT(t.UserID IS NOT NULL)
                  OVER (PARTITION BY u.UserID ORDER BY c.rn) || '"' encoding
    FROM users u CROSS JOIN categories c
    LEFT JOIN tablename t
    ON t.UserID = u.UserID AND t.Category = c.Category
  )
SELECT DISTINCT userID, 
       FIRST_VALUE(encoding) OVER (PARTITION BY UserID ORDER BY rn DESC) encoding
FROM cte
ORDER BY userID

This will work for any number of categories.

See the demo.
Results:

UserID encoding
1 "1,1,0"
2 "0,0,1"
3 "1,1,1"
forpas
  • 160,666
  • 10
  • 38
  • 76
0

First create an encoding table to explicit establish order of categories in the bitmap:

create table e (Category int, Encoding int);
insert into e values ('a', 1), ('b', 2), ('c', 4);

First generate a list of users u (cross) joined with the encoding table e to get a fully populated (UserId, Category, Encoding) table. Then left join the fully populated table with the user supplied data t. The right hand side t can now be used to drive if we need to set a bit or not:

select
    u.UserId,
    '"' ||
        group_concat(case when t.UserId is null then 0 else 1 end, ', ')
    || '"' 'encoding'
from 
    (select distinct UserID from t) u
    join e
    left natural join t
group by 1
order by e.Encoding

and it gives the expected result:

1|"1, 1, 0"
2|"0, 0, 1"
3|"1, 1, 1"
Allan Wind
  • 23,068
  • 5
  • 28
  • 38
  • This makes sense, but I might have a lot of possible values in the columns: a, b, c, ...zz, so I would need a method to avoid adding them manually. – camoverride Mar 28 '21 at 23:25
  • How do specify the encoding? – Allan Wind Mar 28 '21 at 23:27
  • It's a binary encoding, so if the values A, B, C exist in the data, then the vector "A, B, C" could be mapped to "1, 1, 1" if all values are present for a given UserID, or "0, 0, 0" if none of the values are present, etc. – camoverride Mar 28 '21 at 23:30
  • Yeah, I get that, but is the encoding for zt? Do you have a table with the encoding and if so what are the values? – Allan Wind Mar 28 '21 at 23:33
  • What do you mean by zt? The encoding is technically arbitrary and might change if additional values are added or removed from the dataset. I could also arbitrarily assign an encoding like "B, A, C" – camoverride Mar 28 '21 at 23:45
  • Basically, I am asking to do this but in SQLite and encoded as a vector instead of columns: https://stackoverflow.com/questions/34010002/how-to-create-dummy-variable-columns-for-thousands-of-categories-in-google-bigqu – camoverride Mar 28 '21 at 23:55