1

Data: user_choices table

id userid choices
1 3 1,2,3
2 55 5
3 783 2,6,7
4 45 1,4
5 5 1,2,3,4,5,6,7

How can I query this table to explode the choices column filtered only to show values 1 or 2 or 3 or 4 or 7 and group it by userid as follows?

userid choice
3 1
3 2
3 3
783 2
783 7
45 1
45 4
5 1
5 2
5 3
5 4
5 7

I have tried the following, but the results include 5 and 6, which should be omitted.

select replace(substring(substring_index(uc.choices, ',', 1), CHAR_LENGTH(SUBSTRING_INDEX(uc.choices, ',', -1)), + 1), ',', '') choice,,
                uc.userid
from user_choices uc
         join (select 1 digit union all select 2 union all select 3 union all select 4 union all select 7) n on length(replace(uc.choices, ',', '')) < length(uc.choices) - n.digit
where choices regexp '[12347]'
order by uc.userid

Note: I know the regexp isn't perfect here (matches on 22, for example). The table here won't ever receive this, so I don't think stricter regex is required.

OUTPUT:

choice userid
1 3
1 5
1 5
1 5
1 5
2 783

This attempt is based on this answer to another question, but I can't quite get it working for my scenario.

Your help is much appreciated!

Zectzozda
  • 77
  • 7
  • Why is this tagged with csv and unnest (quote from the tag's description: "unnest is a function from the tidyr package that can expand the list columns")? – Nico Haase Apr 18 '23 at 11:31
  • No idea how those tags appeared. I only put in mysql8.0. I think StackOverflow tried to guess other tags? I'll see if I can remove them. – Zectzozda Apr 18 '23 at 23:08

2 Answers2

2

An alternative to the numbers table technique is to use JSON.

The trick is that, given the format of your CSV strings (numbers separated by commas), we can easily turn them to valid JSON values by just surrounding them with square brackets; eg [1,2,3] is a valid JSON array. Then, we can use handy JSON function json_table() to unnest the arrays to rows. The last step is filtering.

So:

select t.userid, j.choice
from mytable t
cross join json_table( concat('[', t.choices, ']'), '$[*]' columns (choice int path '$') ) j
where j.choice in (1, 2, 3, 4, 7)
order by t.userid, j.choice
userid choice
3 1
3 2
3 3
5 1
5 2
5 3
5 4
5 7
45 1
45 4
783 2
783 7

fiddle

GMB
  • 216,147
  • 25
  • 84
  • 135
  • 1
    I didn't know cross-joins even existed. I like this solution - it works well with the rest of the process (omitted for confidentiality reasons). +1 and marked as the solution. It works fast with my dataset (about 3000 records). – Zectzozda Apr 19 '23 at 01:58
0

I'm not clear on exactly how your sql is failing, but you seem to be not getting the approach quite right.

First, you should only join digits that match the number of choices in each row (so 1, 2, and 3 for '1,6,7'); to do this you count the number of commas, add one (since there is one more choice than commas), and compare to the digit:

on char_length(choices)-char_length(replace(choices,',',''))+1 >= digit

You seem to be trying something like that but not quite correctly.

Second, to extract the nth choice, you want to use substring_index to get the first n choices and use it again to get the last of those:

substring_index(substring_index(choices,',',digit),',',-1)

I didn't follow what your replace/substring was even trying to do but the double substring_index is much simpler.

So:

select
    substring_index(substring_index(uc.choices,',',digit),',',-1) choice,
    uc.userid
from user_choices uc
join (select 1 digit union all select 2 union all select 3 union all select 4 union all select 7) n
    on char_length(uc.choices)-char_length(replace(uc.choices,',',''))+1 >= digit
where uc.choices regexp '[12347]'
having choice in (1,2,3,4,7)
order by uc.userid

If you do have choices where the regexp isn't good enough, you can use e.g.:

where find_in_set(1,uc.choices) or find_in_set(2,uc.choices) or find_in_set(3,uc.choices) or find_in_set(4,uc.choices) or find_in_set(7,uc.choices)
ysth
  • 96,171
  • 6
  • 121
  • 214