-1

I have a table with as structure like the following, with an unknown number of rows with each group index.

Group || PropertyA || PropertyB || PropertyC
============================================
  1   ||   x1      ||    x12    ||    x13
  2   ||   x21     ||    x23    ||    x23
  3   ||   x31     ||    x32    ||    x33
  3   ||   x41     ||    x42    ||    x43
 ...       ...           ...          ...

I want all rows with the same Group index to be in a single row, concatenated, as follows:

Group || PropertyA || PropertyB || PropertyC || PropertyA1 || PropertyB1 || PropertyC1 ...
==================================================================================
  1   ||   x1      ||    x12    ||    x13    ||   NULL    ||   NULL    ||  NULL
  2   ||   x21     ||    x23    ||    x23    ||   NULL    ||   NULL    ||  NULL
  3   ||   x31     ||    x32    ||    x33    ||   x41     ||    x42    ||   x43

 ...       ...           ...          ...

I have attempted this using dynamic SQL, and have also attempted using pandas pandas.pivot() and pandas.pivot_table() but my skills with both have failed me so far. The database I have access to is SQL Server.

This issue is that I'm trying to add an unknown number of columns, as a Group index might appear multiple times.

Anything that even approximates the result could be a great help - the column names are unimportant, and can simply repeat. I just need all the data from each Group in one long row.

Any help is greatly appreciated.

2 Answers2

0

We can use DataFrame.pivot_table with GroupBy.cumcount and remove the MultiIndex from the columns at the end.

new_df = (df.pivot_table(index='Group', 
                         columns=df.groupby('Group').cumcount(),
                         aggfunc='first')
            .sort_index(axis=1, level=1))
new_df = new_df.set_axis([f'{x}{y}' if y != 0 else x 
                           for x, y in new_df.columns],
                           axis=1).reset_index()
print(new_df)

Output

   Group PropertyA PropertyB PropertyC PropertyA1 PropertyB1 PropertyC1
0      1        x1       x12       x13        NaN        NaN        NaN
1      2       x21       x23       x23        NaN        NaN        NaN
2      3       x31       x32       x33        x41        x42        x43

note that aggfunc = 'first' has simply been used so that it does not fail using np.mean (default aggfunc) with str values. This method will not lose information from your DataFrame when you have used cumcount.

If you want to learn more about pivot you can read How pivot a DataFrame. Your question is similar to Question 10.

I decided not to close this question as a duplicate for two reasons. One is that ìnsert is not necessary and the other is that here you have to change the name of the columns in a somewhat different way.

ansev
  • 30,322
  • 5
  • 17
  • 31
0

It's going to depend what format you need the final result - if you're planning on changing the database on SQL Server - you're going to have to use an SQL "ALTER TABLE" to add columns. The SQL table needs to have a set number of columns - this might be a bad idea.

However, if you're just taking the data and want to use it in another way in python, you can manipulate it pretty easily in lists. if your data is a 2-dimensional list (a list of lists) - then each list can be a variable length.

output_table = []
groups = []
for row in table:
    if row[0] in groups:
        row_to_extend = next(line for line in output_table if line[0] == row[0])
        row_to_extend.extend(row)
    else:
        groups.append(row[0])
        output_table.append(row)
Kyle Alm
  • 587
  • 3
  • 14
  • We have method of pandas which are created to make this tasks... I think use loops here is slow besides being a syntax much more difficult to read – ansev Apr 17 '20 at 14:14
  • pandas is using loops to build the dataframe. To each their own - I personally try not to invoke a module if it's not necessary. Namespace readability is nice, but to me, logic readability is better - wysiwyg. – Kyle Alm Apr 17 '20 at 19:05
  • just check the times it takes for both methods (to get the exact solution) for different data frames and you will see that their method is much slower. Pandas methods obviously use loops like any code, but no external loops are used. They are optimized on numpy, and obviously it is logical to use them. *I personally try not to invoke a module if it's not necessary* , You really may not even need to use python. But python is used for some reason. Same with pandas. I personally try not to write code with `if-else` inside a loop if it is not necessary – ansev Apr 17 '20 at 20:40
  • The time required for Pandas to operate on a dataframe isn't comparable without including the time it takes pandas to also create the dataframe. – Kyle Alm Apr 19 '20 at 13:43