1

I have the following data set (sample):

emplid | Citizeship |
100001 | USA        |
100001 | CAN        |
100001 | CHN        |
100002 | USA        |
100002 | CHN        |
100003 | USA        |

Is there a way to transform the data into the following:

emplid | Citizeship_1 | Citizenship_2 | Citizenship_3
100001 | USA          | CHN           | CAN
100002 | USA          | CHN           |
100003 | USA          |               |

The assumption is that each emplid will have up to 4 citizenships.

I started with the following codes, but for the emplids who just have 1 citizenship, the value is being repeated in the citizenship_2, citizenship_3, which should be just blank:

 select *
 , substring_index(Citizenship_multiple, ',', 1) as Citizenship_1
 , substring_index(substring_index(Citizenship_multiple,',',-1),',',1) as Citizenship_2
 , substring_index(substring_index(Citizenship_multiple,',',-2),',',1) as Citizenship_3
 , substring_index(substring_index(Citizenship_multiple,',',-3),',',1) as Citizenship_4 
 from
     (select * 
      , group_concat(distinct Citizenship) as Citizenship_multiple
      from `citizenship_csv_meta` 
       group by emplid) a
PMa
  • 1,751
  • 7
  • 22
  • 28

2 Answers2

1

you can do it with case and max

SELECT emplid,
       max(case when Citizeship = 'USA' then 'USA' else '' end) as Citizeship_1,
       max(case when Citizeship = 'CHN' then 'CHN' else '' end) as Citizeship_2,
       max(case when Citizeship = 'CAN' then 'CAN' else '' end) as Citizeship_3
FROM citizenship_csv_meta
GROUP BY emplid
radar
  • 13,270
  • 2
  • 25
  • 33
  • thanks, but the data I gave is just a small sample, the real dataset will have over 170 countries, so I dont think hard coding country is efficient :( – PMa Nov 04 '14 at 23:16
1

I know you stated hardcoding was a pain, and likely not the best solution, but I was able to do this while using only one assumption: that an employee can have at most 4 citizenships. So, I just joined your table together 4 times. I had to use an outer join, because not every employee would have 4 citizenships. Here is the code, and I will explain what I did:

SELECT e.emplid, MAX(e.citizenship) AS citizenship1, 
                 MAX(e1.citizenship) AS citizenship2, 
                 MAX(e2.citizenship) AS citizenship3, 
                 MAX(e3.citizenship) AS citizenship4
FROM employee e
LEFT JOIN employee e1 ON e1.emplid = e.emplid AND e1.citizenship < e.citizenship
LEFT JOIN employee e2 ON e2.emplid = e1.emplid AND e2.citizenship < e1.citizenship
LEFT JOIN employee e3 ON e3.emplid = e2.emplid AND e3.citizenship < e2.citizenship
GROUP BY e.emplid

I joined your table together 4 times, and took the MAX() citizenship from each group. The reason this works is because in the join condition i used e1.citizenship < e.citizenship to make sure that the previous values weren't included. For example, table e2 never included USA, so I was able to use the max function again.

What this will do is that once an employee no longer has a citizenship, the cell in the remaining columns is null, so you will need to be aware of that.

This tested beautifully on SQL Fiddle, and I actually referenced this question to figure out how to get the succeeding citizenships. Of course, I used a method slightly different from theres, but I want to give credit where credit is due.

EDIT

If you want the null cells replaced with a blank value, refer to this SQL Fiddle.

Community
  • 1
  • 1
AdamMc331
  • 16,492
  • 10
  • 71
  • 133
  • I will let you know if it is working, but as of now, the query is still running and it has been running for over 10 mins. I am not sure if that is because my data set is large and self-join usually slows down the performance? – PMa Nov 05 '14 at 05:22
  • It' possible. I only tested with a handful of rows on SQL fiddle, and I'm certainly no efficiency expert, but how large is your dataset that it runs for 10 minutes??? – AdamMc331 Nov 05 '14 at 05:26
  • @PerriMa I take it you had no luck? – AdamMc331 Nov 05 '14 at 12:18
  • i terminated the qry since it was running and running, i am going to test on a smaller dataset today, i will keep you posted, thanks – PMa Nov 05 '14 at 15:17
  • @Perrima okay. Like I linked to, it ran smoothly on the 6 or 7 rows I had, but I don't know at what point it becomes that inefficient. – AdamMc331 Nov 05 '14 at 15:21
  • @PerriMa follow [this question](http://stackoverflow.com/questions/26786560/how-can-i-improve-this-4-count-self-join) there are some other queries in there. – AdamMc331 Nov 06 '14 at 18:27