0

I've a column called "Description" in a table called "Food" which includes multiple food item names delimited by , such as chicken, soup, bread, coke

How can I extract each item from the column and create multiple rows. e.g. Currently it's like {FoodID, FoodName, Description} ==> {123, Meal, "chicken, soup, bread, coke"}

and what I need is

{FoodID, FoodName, Description} ==> {123, Meal, chicken}, {123, Meal, soup}, {123, Meal, bread} etc.

In Redshift, I first did a split of "description" column as

select FoodID, FoodName, Description, 
SPLIT_PART(Description, ',',1) AS Item1, 
SPLIT_PART(Description, ',',1) AS Item2,
SPLIT_PART(Description, ',',1) AS Item3,.....till Item10
FROM Food

consider that max of 10 items can be there and hence Item10. What's the best method to convert these columns Item1 to Item10 to store as rows? I tried UNION ALL but it's taking a longer time considering huge load of data.

stack_pointer is EXTINCT
  • 2,263
  • 12
  • 49
  • 59
  • `SPLIT_PART` with a union seems like a viable option. Really, you should never have even imported such unnormalized data into your database. You may have to bite the bullet until you can normalize it properly. – Tim Biegeleisen Jun 07 '18 at 05:49

1 Answers1

2

Your question is answered here in detail specifically for Redshift. You just need to map your queries to example queries provided over there. Your query will be something like below.

select  (row_number() over (order by true))::int as n into numbers from food limit 100;

This will create numbers table.

Your query would become:

select  foodId,   foodName,   split_part(Description,',',n) as descriptions from  food  cross join  numbers where  split_part(Description,',',n) is not null  and split_part(Description,',',n) != '';

Now, coming to back to your original question about performance.

it's taking a longer time considering huge load of data.

Considering typical data warehouse use case of high read and seldom write, you should be keeping your typical food data that you have mentioned in stagging table, say stg_food.

You should use following kind of query to do one time insert into actual food table, something like below.

insert into  food  select  foodId,  foodName,  split_part(Description,',',n) as descriptions from  stg_food  cross join  numbers where   split_part(Description,',',n) is not null  and split_part(Description,',',n) != '';

This will write one time and make your select queries faster.

Red Boy
  • 5,429
  • 3
  • 28
  • 41
  • Thank you Red Boy. I was able to use Cross Join logic to have my solution. There was no need for me to use Row Number as max there are only 9 commas (10 items). So I used a number iterator along with Cross Join. – stack_pointer is EXTINCT Jun 08 '18 at 00:38