SQL query in pandas: Concat multiple rows in column based on a combination of other columns

Question

This is response from one sql query in pandas. I want to concatenate, "Label column", together based on Issue and client. I tried Group By, but is only works with integer values. Any idea, how I can do this? Pandas based solution should also be fine.

I tried .groupby, in pandas too, command and output below, it only gives me a subset of needed dataframe.

Is it possible to update the label column against each Issue in first dataframe and removed duplicates and get output as expected below?

SQL Version is:

Microsoft SQL Server 2014

Output:

	Issue	Subject	type	Team	Sub Team	Client	Priority	CreatedOn	Label	BuiltOn	CreatedBy	Status
0	1	ABCABC	Bug	Develop	Automation	Andy	0	2021-01-11 00:00:00	Enhancement	None	John	InProgress
1	2	DEFDEF	Bug	Develop	Automation	Judy	0	2021-01-10 00:00:00	Feature	None	Andre	New
2	3	HIGHIG	Bug	Develop	Testing123	Cathy	2	2021-02-11 00:00:00	Feature	None	Keith	New
3	3	HIGHIG	Bug	Develop	Testing123	Cathy	2	2021-02-11 00:00:00	Internal	None	Keith	New
4	4	XYZXYZ	Bug	Develop	Automation	Jack	1	2021-05-11 00:00:00	Enhancement	None	Maya	Analysis
5	4	XYZXYZ	Bug	Develop	Automation	Jack	1	2021-05-11 00:00:00	Internal	None	Maya	Analysis
6	4	XYZXYZ	Bug	Develop	Automation	Larry	1	2021-05-11 00:00:00	Enhancement	None	Maya	Analysis
7	4	XYZXYZ	Bug	Develop	Automation	Larry	1	2021-05-11 00:00:00	Internal	None	Maya	Analysis
8	4	XYZXYZ	Bug	Develop	Automation	Colin	1	2021-05-11 00:00:00	Enhancement	None	Maya	Analysis
9	4	XYZXYZ	Bug	Develop	Automation	Colin	1	2021-05-11 00:00:00	Internal	None	Maya	Analysis
10	4	XYZXYZ	Bug	Develop	Automation	Nitin	1	2021-05-11 00:00:00	Enhancement	None	Maya	Analysis
11	4	XYZXYZ	Bug	Develop	Automation	Nitin	1	2021-05-11 00:00:00	Internal	None	Maya	Analysis
12	4	XYZXYZ	Bug	Develop	Automation	Lisa	1	2021-05-11 00:00:00	Enhancement	None	Maya	Analysis
13	4	XYZXYZ	Bug	Develop	Automation	Lisa	1	2021-05-11 00:00:00	Internal	None	Maya	Analysis

Expected (Note the label column):

	Issue	Subject	Issue_type	Team	Sub Team	Client	Priority	CreatedOn	Label	BuiltOn	CreatedBy	Status
0	1	ABC	Bug	Develop	Automation	Andy	0	2021-01-11 00:00:00	Enhancement	None	John	InProgress
1	2	DEF	Bug	Develop	Automation	Judy	0	2021-01-10 00:00:00	Feature	None	Andre	New
2	3	HIG	Bug	Develop	Testing	Cathy	2	2021-02-11 00:00:00	Feature, Internal	None	Keith	New
3	4	XYZ	Bug	Develop	Automation	Jack	1	2021-05-11 00:00:00	Enhancement, Internal	None	Maya	Analysis
4	4	XYZ	Bug	Develop	Automation	Larry	1	2021-05-11 00:00:00	Enhancement, Internal	None	Maya	Analysis
5	4	XYZ	Bug	Develop	Automation	Colin	1	2021-05-11 00:00:00	Enhancement, Internal	None	Maya	Analysis
6	4	XYZ	Bug	Develop	Automation	Nitin	1	2021-05-11 00:00:00	Enhancement, Internal	None	Maya	Analysis
7	4	XYZ	Bug	Develop	Automation	Lisa	1	2021-05-11 00:00:00	Enhancement, Internal	None	Maya	Analysis

Update: This is they query:

SELECT I.Issue,
       I.Subject,
       I.type, 
       P.Team, 
       P.Subteam,
       CR.Client,
       I.Priority,
       I.CreatedOn,
       L.Label,
       I.BuiltOn,
       I.CreatedBy,
       I.Status
 FROM master.IssueRequests AS I 
 JOIN master.Participants AS P 
   ON P.Issue = I.Issue 
 JOIN master.ClientRecords AS CR 
   ON CR.Issue = I.Issue 
 JOIN master.IssueLabels AS L
   ON L.Issue = I.Issue
 WHERE I.Issue IN ('2652523', '2703670', '2984120')

Update2 Output of df.groupby:

df.groupby(['Issue', 'Client'])['Label'].apply(','.join).reset_index()

Output:

	Issue	Client	Label
0	1	Andy	Enhancement
1	2	Judy	Feature
2	3	Cathy	Feature,Internal
3	4	Colin	Enhancement,Internal
4	4	Jack	Enhancement,Internal
5	4	Larry	Enhancement,Internal
6	4	Lisa	Enhancement,Internal
7	4	Nitin	Enhancement,Internal

Clarification: Merging on all columns except Label will not work since in some cases, some of the other data might be "null" or different, which can cause data to be missed altogether. If the data is different in other columns, I can keep the first instance of that data.

@akshat what do you want to happen to the other columns? Is priority the average? and which created on date do you want to be kept? same with createdby and status etc — Emi OB, Dec 03 '21 at 09:53
@EmiOB, that does not work in all cases, just gave a comment on buddemat's answer below. — akshat, Dec 03 '21 at 10:05
Then your question needs to be edited and @EmiOB's question needs an answer. What do you want to happen with the other columns if the data is different? You cannot just _keep them_ if they should be grouped... — buddemat, Dec 03 '21 at 10:07
@buddemat, if the data is different in other columns, I can keep the first instance of that data. Adding 50 columns to the question will too much, ill try to find the example columns from the data and update. — akshat, Dec 03 '21 at 10:50

buddemat · Answer 1 · 2021-12-03T14:26:11.733

Update: After clarification of the OP, it seems that the problem actually is a bit different: The columns' contents except the grouping columns Issue and Client may actually differ between grouped rows and the final result should contain the first row's value for columns that differ between such grouped rows.

An approach to do this could be to perform the grouping in Python as you did before and then join (using merge()) that with a version of the original dataframe in which you drop all duplicates based on Issue and Client (as well as the Label column). That gives you exactly the first instance for each grouped row in case data differs.

Without additional arguments, merge() will automatically do an inner join on all columns available in both dataframes, which in this case are Issue and Client:

df.groupby(['Issue', 'Client'])['Label'].apply(','.join).reset_index().merge(df.drop('Label', axis=1).drop_duplicates(['Issue', 'Client']))

Output:

	Issue	Subject	type	Team	Sub Team	Client	Priority	CreatedOn	BuiltOn	CreatedBy	Status	Label
0	1	ABCABC	Bug	Develop	Automation	Andy	0	2021-01-11 00:00:00	None	John	InProgress	Enhancement
1	2	DEFDEF	Bug	Develop	Automation	Judy	0	2021-01-10 00:00:00	None	Andre	New	Feature
2	3	HIGHIG	Bug	Develop	Testing123	Cathy	2	2021-02-11 00:00:00	None	Keith	New	Feature,Internal
3	4	XYZXYZ	Bug	Develop	Automation	Colin	1	2021-05-11 00:00:00	None	Maya	Analysis	Enhancement,Internal
4	4	XYZXYZ	Bug	Develop	Automation	Jack	1	2021-05-11 00:00:00	None	Maya	Analysis	Enhancement,Internal
5	4	XYZXYZ	Bug	Develop	Automation	Larry	1	2021-05-11 00:00:00	None	Maya	Analysis	Enhancement,Internal
6	4	XYZXYZ	Bug	Develop	Automation	Lisa	1	2021-05-11 00:00:00	None	Maya	Analysis	Enhancement,Internal
7	4	XYZXYZ	Bug	Develop	Automation	Nitin	1	2021-05-11 00:00:00	None	Maya	Analysis	Enhancement,Internal

Previous answer:

Just group by all of the other columns:

import pandas as pd

data = [[1, 'ABCABC', 'Bug', 'Develop', 'Automation', 'Andy', 0, '2021-01-11 00:00:00', 'Enhancement', 'None', 'John', 'InProgress'],
[2, 'DEFDEF', 'Bug', 'Develop', 'Automation', 'Judy', 0, '2021-01-10 00:00:00', 'Feature', 'None', 'Andre', 'New'],
[3, 'HIGHIG', 'Bug', 'Develop', 'Testing123', 'Cathy', 2, '2021-02-11 00:00:00', 'Feature', 'None', 'Keith', 'New'],
[3, 'HIGHIG', 'Bug', 'Develop', 'Testing123', 'Cathy', 2, '2021-02-11 00:00:00', 'Internal', 'None', 'Keith', 'New'],
[4, 'XYZXYZ', 'Bug', 'Develop', 'Automation', 'Jack', 1, '2021-05-11 00:00:00', 'Enhancement', 'None', 'Maya', 'Analysis'],
[4, 'XYZXYZ', 'Bug', 'Develop', 'Automation', 'Jack', 1, '2021-05-11 00:00:00', 'Internal', 'None', 'Maya', 'Analysis'],
[4, 'XYZXYZ', 'Bug', 'Develop', 'Automation', 'Larry', 1, '2021-05-11 00:00:00', 'Enhancement', 'None', 'Maya', 'Analysis'],
[4, 'XYZXYZ', 'Bug', 'Develop', 'Automation', 'Larry', 1, '2021-05-11 00:00:00', 'Internal', 'None', 'Maya', 'Analysis'],
[4, 'XYZXYZ', 'Bug', 'Develop', 'Automation', 'Colin', 1, '2021-05-11 00:00:00', 'Enhancement', 'None', 'Maya', 'Analysis'],
[4, 'XYZXYZ', 'Bug', 'Develop', 'Automation', 'Colin', 1, '2021-05-11 00:00:00', 'Internal', 'None', 'Maya', 'Analysis'],
[4, 'XYZXYZ', 'Bug', 'Develop', 'Automation', 'Nitin', 1, '2021-05-11 00:00:00', 'Enhancement', 'None', 'Maya', 'Analysis'],
[4, 'XYZXYZ', 'Bug', 'Develop', 'Automation', 'Nitin', 1, '2021-05-11 00:00:00', 'Internal', 'None', 'Maya', 'Analysis'],
[4, 'XYZXYZ', 'Bug', 'Develop', 'Automation', 'Lisa', 1, '2021-05-11 00:00:00', 'Enhancement', 'None', 'Maya', 'Analysis'],
[4, 'XYZXYZ', 'Bug', 'Develop', 'Automation', 'Lisa', 1, '2021-05-11 00:00:00', 'Internal', 'None', 'Maya', 'Analysis']]

df = pd.DataFrame(data, columns = ['Issue', 'Subject', 'type', 'Team', 'Sub Team', 'Client', 'Priority', 'CreatedOn', 'Label', 'BuiltOn', 'CreatedBy', 'Status']) 

df.groupby(['Issue', 'Subject', 'type', 'Team', 'Sub Team', 'Client', 'Priority', 'CreatedOn', 'BuiltOn', 'CreatedBy', 'Status'])['Label'].apply(','.join).reset_index()

If you don't want to write all column names, you can also use list comprehension to automatically build the list and exclude just the Label column from it, similar to this SO answer:

df.groupby([col for col in list(df) if col not in ['Label']])['Label'].apply(','.join).reset_index()

I tried this too, however, there are two issues with this, in my actual dataframe, there are more 50 columns, and I have multiple such queries. So it is quite error prone to write and all columns. also, in some cases, some of the other data might be "null" or different. which can causes data to be missed all together. — akshat, Dec 03 '21 at 10:02
Well, then please edit your question so that your example data and your question reflect that and tell us, how you would the result to look like when the data are different. — buddemat, Dec 03 '21 at 10:05
I modified my answer to fir your revised problem description. — buddemat, Dec 03 '21 at 14:19

buddemat · Answer 2 · 2021-12-03T15:42:32.687

I'll offer a second approach using SQL.

Concatenating strings in a GROUP BY is a bit tricky in MS Sql Server, since there is no direct function as in other RDBMS. However, there is a workaround using FOR XML and PATH that can be adapted to your problem.

The following statement gives you the concatenated labels based on your original query:

WITH tmp AS (SELECT I.Issue,
                    I.Subject,
                    I.type, 
                    P.Team, 
                    P.Subteam,
                    CR.Client,
                    I.Priority,
                    I.CreatedOn,
                    L.Label,
                    I.BuiltOn,
                    I.CreatedBy,
                    I.Status
              FROM master.IssueRequests AS I 
              JOIN master.Participants AS P 
                ON P.Issue = I.Issue 
              JOIN master.ClientRecords AS CR 
                ON CR.Issue = I.Issue 
              JOIN master.IssueLabels AS L
                ON L.Issue = I.Issue
              WHERE I.Issue IN ('2652523', '2703670', '2984120')
)
SELECT A.Issue,
       A.Client,
       STUFF((
           SELECT ', ' + B.Label 
             FROM tmp B 
            WHERE ISNULL(B.Issue, '') = ISNULL(A.Issue, '')
              AND ISNULL(B.Client, '') = ISNULL(A.Client, '')
         ORDER BY B.Issue 
          FOR XML PATH('')), 1, 2, ''
    ) AS Label
FROM
    tmp A
GROUP BY 
    A.Issue, A.Client

This gives you

Issue	Client	Label
1	Andy	Enhancement
2	Judy	Feature
3	Cathy	Feature,Internal
4	Colin	Enhancement,Internal
4	Jack	Enhancement,Internal
4	Larry	Enhancement,Internal
4	Lisa	Enhancement,Internal
4	Nitin	Enhancement,Internal

You can then use ROW_NUMBER() to JOIN this with the first row of each Issue-Client-combination:

WITH tmp AS (SELECT I.Issue,
                    I.Subject,
                    I.type, 
                    P.Team, 
                    P.Subteam,
                    CR.Client,
                    I.Priority,
                    I.CreatedOn,
                    L.Label,
                    I.BuiltOn,
                    I.CreatedBy,
                    I.Status
              FROM master.IssueRequests AS I 
              JOIN master.Participants AS P 
                ON P.Issue = I.Issue 
              JOIN master.ClientRecords AS CR 
                ON CR.Issue = I.Issue 
              JOIN master.IssueLabels AS L
                ON L.Issue = I.Issue
              WHERE I.Issue IN ('2652523', '2703670', '2984120')
)
SELECT C.Issue,
       C.Subject,
       C.typ, 
       C.Team, 
       C.Subteam,
       C.Client,
       C.Priority,
       C.CreatedOn,
       D.Label,
       C.BuiltOn,
       C.CreatedBy,
       C.Status  
  FROM (SELECT tmp.*, 
               row_number() OVER (PARTITION BY Issue, Client ORDER BY Issue) as rn
        FROM tmp) C
  JOIN (SELECT A.Issue,
               A.Client,
               STUFF((
                   SELECT ', ' + B.Label 
                     FROM tmp B 
                    WHERE ISNULL(B.Issue, '') = ISNULL(A.Issue, '')
                      AND ISNULL(B.Client, '') = ISNULL(A.Client, '')
                 ORDER BY B.Issue 
                  FOR XML PATH('')), 1, 2, ''
            ) AS Label
        FROM tmp A
        GROUP BY A.Issue, A.Client) D
   ON C.Issue = D.Issue AND C.Client = D.Client AND C.rn = 1

which gives you your desired result:

Issue	Subject	type	Team	Sub Team	Client	Priority	CreatedOn	BuiltOn	CreatedBy	Status	Label
1	ABCABC	Bug	Develop	Automation	Andy	0	2021-01-11 00:00:00	None	John	InProgress	Enhancement
2	DEFDEF	Bug	Develop	Automation	Judy	0	2021-01-10 00:00:00	None	Andre	New	Feature
3	HIGHIG	Bug	Develop	Testing123	Cathy	2	2021-02-11 00:00:00	None	Keith	New	Feature,Internal
4	XYZXYZ	Bug	Develop	Automation	Colin	1	2021-05-11 00:00:00	None	Maya	Analysis	Enhancement,Internal
4	XYZXYZ	Bug	Develop	Automation	Jack	1	2021-05-11 00:00:00	None	Maya	Analysis	Enhancement,Internal
4	XYZXYZ	Bug	Develop	Automation	Larry	1	2021-05-11 00:00:00	None	Maya	Analysis	Enhancement,Internal
4	XYZXYZ	Bug	Develop	Automation	Lisa	1	2021-05-11 00:00:00	None	Maya	Analysis	Enhancement,Internal
4	XYZXYZ	Bug	Develop	Automation	Nitin	1	2021-05-11 00:00:00	None	Maya	Analysis	Enhancement,Internal

You can test it in this db<>dfiddle

SQL query in pandas: Concat multiple rows in column based on a combination of other columns

2 Answers2