2

I have a pandas dataframe that is the result of a query where 1 column creates duplicate rows. I need help identifying non-duplicate values for duplicates by name, then dynamically creating new columns with all values, then delete duplicates. Below Mike has duplicates in column "Code" and Mark in "Lang", so I'd like one row for each with new columns for the non-dupe values.

 ID |  Name  |  Code  |  Lang  |
 1  |  Mike  |   25   |   SQL  |
 1  |  Mike  |   26   |   SQL  |
 1  |  Mike  |   27   |   SQL  |
 2  |  Mark  |   39   |  NoSQL |
 2  |  Mark  |   39   |   SQL  |

Loop through and identify which columns are not duplicates, copy non-duplicate value, write to new column in first near duplicate row, delete duplicates.

 ID |  Name  |  Code  |  Code2 |  Code3 |  Lang  | Lang2 |
 1  |  Mike  |   25   |   26   |   27   |   SQL  |   .   |
 2  |  Mark  |   39   |    .   |    .   |  NoSQL |  SQL  |

I'm able to get to just the duplicate rows by using the below, but have done a lot of research and am having trouble getting to my result. I'm exploring pivot and melt as an option but am stuck on the dynamic column part.

dup_rows = orig_df[orig_df.duplicated(['Name'])]
CGermain
  • 369
  • 7
  • 18
  • Is there a particular reason why you need that specific output? In general I find wide data are harder to work with, and can also take up a lot more space. The above manipulation definitely isn't trivial. – ALollz Feb 24 '20 at 22:35
  • I have a lot of duplicates generated by a few columns that I would like to retain as new colums. In most cases it's a one or two duplicates and only 1-2 columns, so it wouldn't get too wide. I can do this fairly easily in SAS but have since switched to python. I've also looked into writing it in SQL but thought python would have functions to do this fairly easily. I could live with this not being dynamic and identify which columns produced the duplicates and address those only. – CGermain Feb 24 '20 at 22:47

1 Answers1

2

We can mark the duplicates per group with GroupBy, duplicated and cumsum Then use pivot_table to pivot the rows to columns and finally we use pd.concat to get a single dataframe back:

columns = ['Code', 'Lang']
dfs = []

for col in columns:
    df['cols'] = (
        col + df.groupby(['ID', 'Name'], sort=False)
                .apply(lambda x: (~x[col].duplicated()).cumsum()).astype(str).to_numpy()
    )
    dfs.append(df.pivot_table(index=['ID', 'Name'], columns='cols', values=col, aggfunc='first'))

dfn = pd.concat(dfs, axis=1).reset_index().rename_axis(None, axis=1)

   ID  Name  Code1  Code2  Code3  Lang1 Lang2
0   1  Mike   25.0   26.0   27.0    SQL   NaN
1   2  Mark   39.0    NaN    NaN  NoSQL   SQL
Erfan
  • 40,971
  • 8
  • 66
  • 78
  • for simplicity I included a summarized example from my real df. In my df I have ~40 columns with varying dtypes. When I include only the two columns that I would like write values to new columns ['Code', 'Lang'] it doesn't work for my entire df. When I included all other columns in the df.groupby, I receive this error due to the various dtypes I have = TypeError: ufunc 'add' did not contain a loop with signature matching types dtype(' . I'm going to try splitting the df, applying the above, and merging. – CGermain Mar 01 '20 at 13:27