0

I have the following database:

This is just a subset of a larger data set that would have been too big to type.

ID CODE IDENTIFIERS COLORS SHAPES NUMB
101 24 N RED CIRCLE 174
102 32 N BLUE SQUARE 155
102 32 N PURPLE STAR 223
103 13 Y RED SQUARE 143
103 13 Y GREEN CIRCLE 123
103 13 Y BLUE SQUARE 142

I need the following result:

ID CODE IDENTIFIERS COLORS1 COLORS2 COLORS3 SHAPES1 SHAPES2 SHAPES3 NUMB1 NUMB2 NUMB3
101 24 N RED CIRCLE 174
102 32 N BLUE PURPLE SQUARE STAR 155 223
103 13 Y RED GREEN BLUE SQUARE CIRCLE SQUARE 143 123 142

I tried:

    SELECT ID, 
     CODE, 
     IDENTIFIERS, 
     group_concat(COLORS) AS COLORSS, 
     group_concat(SHAPES) AS SHAPESS, 
     group_concat(NUMB) AS NUMBS 
     FROM ( 
     SELECT a.ID, 
     a.CODE, 
     a.IDENTIFIERS, 
     a.COLORS, 
     a.SHAPES, 
     a.NUMB 
     FROM database AS a 
    ) AS sub 
    GROUP BY ID, CODE, IDENTIFIERS

I got:

ID CODE IDENTIFIERS COLORS SHAPES NUMB
101 24 N RED CIRCLE 174
102 32 N BLUE,PURPLE SQUARE,STAR 155,223
103 13 Y RED,GREEN,BLUE SQUARE,CIRCLE,SQUARE 143,123,142

Which is close, but not exactly what I was going for. Any help would be appreciated.

Ken White
  • 123,280
  • 14
  • 225
  • 444
anni
  • 83
  • 5
  • Will there always be a maximum of three input rows per id/code/identifiers combination? – MatBailie Mar 23 '22 at 22:46
  • What you're asking for is like a [pivot](https://modern-sql.com/use-case/pivot) but because it's an for unordered lists of unknown sizes it's difficult to know how many columns there need to be, nor what goes into what column. You could have, for example, columns for `red`, `blue`, `green`, and `purple` with `true` or `false` as values. But just color1, color2, color3 is difficult. What are you going to do with this result? Maybe there's a better way. – Schwern Mar 23 '22 at 22:48
  • MatBailie - No, there can be any number. I think my max is around 20 for colors, 30 for shapes, and maybe 25 for numbers. A unique ID could have 100 rows associated. This is a much bigger data set, it was just too much to type and I don't have enough reputation to post pictures. Also, the file for my actual data has thousands of rows. – anni Mar 23 '22 at 22:58
  • SQL is both strongly and statically typed. That includes the number of columns in a dataset. If you need the number of columns to vary according to the source data, you're heading in a direction SQL intentionally doesn't support. In SQL the correct structure for such dynamically varying data is the one you're starting with. Which begs the question, why do you think you need to pivot the data? (this is a significant anti-pattern.) – MatBailie Mar 23 '22 at 23:02
  • Schwern - I was using this resource: (https://stackoverflow.com/questions/7867397/sqlite-merging-rows-into-single-row-if-they-share-a-column) but I had trouble with the JOIN command because there were only 2 columns being affected in that example. The problem with using the true\false for the colors is that the colors are just a stand for a number representing a certain shade. There would probably be 100+ colors columns for each row if so. I'm going to use this result for further data analysis. – anni Mar 23 '22 at 23:06
  • MatBailie - I want to have all the data corresponding to a particular ID so I can see how these ID's differ from one another. – anni Mar 23 '22 at 23:08
  • You don't need everything on one row to analyse that. Rather than asking about pivoting, I suggest that you ask how to accomplish a specific analysis without needing to pivot it first. What you're describing so far is a ***very bad idea*** and will only make subsequent SQL harder, not easier. – MatBailie Mar 23 '22 at 23:11
  • I was thinking I could use my current result (second table) and, for each column being affected, get a max count of the comma separated entries. Then I could append that number of columns to the table and fill them with the comma separated values. I just have no clue how to do this. – anni Mar 23 '22 at 23:11
  • SQL doesn't support dynamically varying numbers of columns. You can't think of a way to add the number of columns you need, because SQL ***intentionally*** doesn't work like that. You're trying to hammer a square peg in to a round hole, using a screw driver. – MatBailie Mar 23 '22 at 23:13
  • MatBaile - I'm new to SQL, can you help me with where this could go wrong? I'm hoping to switch to R after I get my database. I would have done this in R but my file is too big (~5gb). I'd just like to have all my data sorted and organized and take from it as I need it. I was recommended to use SQL because of the size. – anni Mar 23 '22 at 23:16
  • MatBaile - The number of affected columns will not be larger than the number of rows of a unique ID. I would like to get this precise but could live with it being larger. Would it be possible to find this number and use it for all affected columns? – anni Mar 23 '22 at 23:18
  • Learn how to use the data in its current form. If there is an analysis you need to do but can't work out how, ask that question. But don't try to pivot the data. SQL doesn't have the constructs to do what you want, because what you want makes everything subsequent harder. SQL is designed to use the structure you have already. Learn to use the language the way it's designed. https://xyproblem.info/ – MatBailie Mar 23 '22 at 23:21
  • MatBailie - Thank you for the help. You went out of your way to help me and I appreciate that. Please excuse my ignorance on the issue, I'm new to SQL (probably obvious). – anni Mar 23 '22 at 23:29

1 Answers1

0

First you need a way to identify which columns you want to put a row into. For that you can use ROW_NUMBER()

Then you can use a combination of MAX(CASE) to pivot the data.

WITH
  sorted AS
(
  SELECT
    *,
    ROW_NUMBER() OVER (PARTITION BY id, code, identifiers)   AS row_id
  FROM
    yourTable
)
SELECT
  id,
  code,
  identifiers,

  MAX(CASE WHEN row_id = 1 THEN colors END)   AS colors_1,
  MAX(CASE WHEN row_id = 1 THEN shapes END)   AS shapes_1,
  MAX(CASE WHEN row_id = 1 THEN numb   END)   AS numb_1,

  MAX(CASE WHEN row_id = 2 THEN colors END)   AS colors_2,
  MAX(CASE WHEN row_id = 2 THEN shapes END)   AS shapes_2,
  MAX(CASE WHEN row_id = 2 THEN numb   END)   AS numb_2,

  MAX(CASE WHEN row_id = 3 THEN colors END)   AS colors_3,
  MAX(CASE WHEN row_id = 3 THEN shapes END)   AS shapes_3,
  MAX(CASE WHEN row_id = 3 THEN numb   END)   AS numb_3
FROM
  sorted
GROUP BY
  id,
  code,
  identifiers
MatBailie
  • 83,401
  • 18
  • 103
  • 137