-1

I have a pandas dataframe that looks something like this:

employeeId     cumbId firstName lastName        emailAddress  \
0    E123456  102939485    Andrew   Hoover   hoovera@xyz.com   
1    E123457  675849302      Curt   Austin  austinc1@xyz.com   
2    E123458  354852739   Celeste  Riddick  riddickc@xyz.com   
3    E123459  937463528     Hazel   Tooley   tooleyh@xyz.com     

  employeeIdTypeCode cumbIDTypeCode entityCode sourceCode roleCode  
0                001            002      AE      AWB    EMPLR  
1                001            002      AE      AWB    EMPLR  
2                001            002      AE      AWB    EMPLR  
3                001            002      AE      AWB    EMPLR  

I want it to look something like this for each ID and IDtypecode in the pandas dataframe:

idvalue   IDTypeCode  firstName lastName  emailAddress  entityCode  sourceCode  roleCode  CodeName
E123456   001         Andrew    Hoover    hoovera@xyz.com AE        AWB         EMPLR     1
102939485 002         Andrew    Hoover    hoovera@xyz.com AE        AWB         EMPLR     1

Can this be achieved with some function in pandas dataframe? I also want it to be dynamic based on the number of IDs that are in the dataframe.

What I mean by dynamic is this, if there are 3 Ids then this is how it should look like:

idvalue   IDTypeCode  firstName lastName  emailAddress  entityCode  sourceCode  roleCode  CodeName
A123456   001         Andrew    Hoover    hoovera@xyz.com AE        AWB         EMPLR     1
102939485 002         Andrew    Hoover    hoovera@xyz.com AE        AWB         EMPLR     1
M1000     003         Andrew    Hoover    hoovera@xyz.com AE        AWB         EMPLR     1

Thank you!

Manas Jani
  • 699
  • 2
  • 11
  • 33
  • You have two IDtypecodes in your example: employeeIdTypeCode and cumbIDTypeCode. If you could create a reproducible example with clear column names and a clear result like how (ID, IDTypeCode) are being grouped, and your expected result, it would go a long way. – Jarad Dec 12 '18 at 20:37
  • I want it grouped based on the ID and the IDtypeCode, it's just that the resulting dataframe needs to have different column name. So `employeeId` and `cumbId` values need to go under `idValue` and similarly for `IDtypeCode` as well – Manas Jani Dec 12 '18 at 20:39

2 Answers2

1

I think this is what you are looking for... you can use concat after splitting out the parts of your dataframe:

# create a new df without the id columns
df2 = df.loc[:, ~df.columns.isin(['employeeId','employeeIdTypeCode'])]

# rename columns to match the df columns names that they "match" to
df2 = df2.rename(columns={'cumbId':'employeeId', 'cumbIDTypeCode':'employeeIdTypeCode'})

# concat you dataframes
pd.concat([df,df2], sort=False).drop(columns=['cumbId','cumbIDTypeCode']).sort_values('firstName')

# rename columns here if you want

update

# sample df
  employeeId     cumbId  otherId1 firstName lastName      emailAddress  \
0    E123456  102939485         5    Andrew   Hoover   hoovera@xyz.com   
1    E123457  675849302         5      Curt   Austin  austinc1@xyz.com   
2    E123458  354852739         5   Celeste  Riddick  riddickc@xyz.com   
3    E123459  937463528         5     Hazel   Tooley   tooleyh@xyz.com   

   employeeIdTypeCode  cumbIDTypeCode  otherIdTypeCode1 entityCode sourceCode  \
0                   1               2                 6         AE        AWB   
1                   1               2                 6         AE        AWB   
2                   1               2                 6         AE        AWB   
3                   1               2                 6         AE        AWB   

  roleCode  
0    EMPLR  
1    EMPLR  
2    EMPLR  
3    EMPLR  

There has to be some rules in place:

rule 1. there are always two "match columns" rule 2. all the matched ids are next to each other rule 3. you know the number of Ids groups (rows to add)

def myFunc(df, num_id): # num_id is the number of id groups 
    # find all columns that contain the string id
    id_col = df.loc[:, df.columns.str.lower().str.contains('id')].columns

    # rename columns to id_0 and id_1
    df = df.rename(columns=dict(zip(df.loc[:, df.columns.str.lower().str.contains('id')].columns,
                                ['id_'+str(i) for i in range(int(len(id_col)/num_id)) for x in range(num_id)])))

    # groupby columns and values.tolist
    new = df.groupby(df.columns.values, axis=1).agg(lambda x: x.values.tolist())

    data = []

    # for-loop to explode the lists
    for n in range(len(new.loc[:, new.columns.str.lower().str.contains('id')].columns)):
        s = new.loc[:, new.columns.str.lower().str.contains('id')]
        i = np.arange(len(new)).repeat(s.iloc[:,n].str.len())
        data.append(new.iloc[i, :-1].assign(**{'id_'+str(n): np.concatenate(s.iloc[:,n].values)}))

    # remove the list from all cells
    data0 = data[0].applymap(lambda x: x[0] if isinstance(x, list) else x).drop_duplicates()
    data1 = data[1].applymap(lambda x: x[0] if isinstance(x, list) else x).drop_duplicates()

    # update dataframes
    data0.update(data1[['id_1']])

    return data0

myFunc(df,3)


      emailAddress entityCode firstName       id_0  id_1 lastName roleCode
0   hoovera@xyz.com         AE    Andrew    E123456     1   Hoover    EMPLR
0   hoovera@xyz.com         AE    Andrew  102939485     2   Hoover    EMPLR
0   hoovera@xyz.com         AE    Andrew          5     6   Hoover    EMPLR
1  austinc1@xyz.com         AE      Curt    E123457     1   Austin    EMPLR
1  austinc1@xyz.com         AE      Curt  675849302     2   Austin    EMPLR
1  austinc1@xyz.com         AE      Curt          5     6   Austin    EMPLR
2  riddickc@xyz.com         AE   Celeste    E123458     1  Riddick    EMPLR
2  riddickc@xyz.com         AE   Celeste  354852739     2  Riddick    EMPLR
2  riddickc@xyz.com         AE   Celeste          5     6  Riddick    EMPLR
3   tooleyh@xyz.com         AE     Hazel    E123459     1   Tooley    EMPLR
3   tooleyh@xyz.com         AE     Hazel  937463528     2   Tooley    EMPLR
3   tooleyh@xyz.com         AE     Hazel          5     6   Tooley    EMPLR
It_is_Chris
  • 13,504
  • 2
  • 23
  • 41
  • This is perfect! I did not even think about it. The only thing other than this is that I wanted it to do it dynamically based on the `Id` and `IdtypeCode` values. Is there a way to do that? Since I am want to generate as many rows as there are `Id` and `IdtypeCode` values. – Manas Jani Dec 13 '18 at 13:30
  • @ManasJani Can you explain a little more: exactly what you mean by "dynamically based on the `Id` and `IdtypeCode` values" – It_is_Chris Dec 13 '18 at 13:45
  • I have 2 `Id` columns now, `employeeId` and `CumbId` which have been aligned to create 2 rows, so if I get more IDs in my input, then I want to populate that many rows in my final dataframe. – Manas Jani Dec 13 '18 at 13:50
  • @ManasJani so you may have `n` number of columns that are id columns correct? – It_is_Chris Dec 13 '18 at 14:32
  • @ManasJani see update is this what you are looking for? – It_is_Chris Dec 13 '18 at 14:49
  • Hey @Chris, I updated my question to show what I meant by dynamic. I am trying to get it to that level, but the code I am trying to write is just not helping me out. – Manas Jani Dec 13 '18 at 16:34
  • Wow, this is awesome! Thank you so much man, really appreciate it. I need to convert this to a nested JSON structure. Could you possibly help me with that? Here is the question that I have posted, https://stackoverflow.com/questions/53731125/pandas-dataframe-to-a-dynamic-nested-json – Manas Jani Dec 14 '18 at 00:41
0

As I understood, for each source row you want to generate 2 rows:

  • employeeId (renamed to idvalue), then IDTypeCode = '001', then 'remainig' columns (but not all) and finally CodeName = '1'.
  • cumbId, then IDTypeCode = '002', the same 'remainig' columns and CodeName (also = '1').

So the program given below generates such 2 DataFrames (df1 and df2) and then generates the result "interleaving" their rows.

import pandas as pd

data = [
    [ 'E123456', '102939485', 'Andrew',  'Hoover',  'hoovera@xyz.com',  '001', '002', 'AE', 'AWB', 'EMPLR' ],
    [ 'E123457', '675849302', 'Curt',    'Austin',  'austinc1@xyz.com', '001', '002', 'AE', 'AWB', 'EMPLR' ],
    [ 'E123458', '354852739', 'Celeste', 'Riddick', 'riddickc@xyz.com', '001', '002', 'AE', 'AWB', 'EMPLR' ],
    [ 'E123459', '937463528', 'Hazel',   'Tooley',  'tooleyh@xyz.com',  '001', '002', 'AE', 'AWB', 'EMPLR' ]
]
df = pd.DataFrame(data=data, columns=['employeeId', 'cumbId', 'firstName', 'lastName',
    'emailAddress', 'employeeIdTypeCode', 'cumbIDTypeCode', 'entityCode', 'sourceCode',
    'roleCode' ])
# 'Remainig' columns
cols = ['firstName', 'lastName', 'emailAddress', 'entityCode', 'sourceCode', 'roleCode']
# df1: employeeId, IDTypeCode = '001' and 'remainig' columns
df1 = df[['employeeId']].set_axis(['idvalue'], axis=1, inplace=False)
df1['IDTypeCode'] = '001'
df1 = df1.join(df[cols])
df1['CodeName'] = '1'
# df2: cumbId, IDTypeCode = '002' and 'remainig' columns
df2 = df[['cumbId']].set_axis(['idvalue'], axis=1, inplace=False)
df2['IDTypeCode'] = '002'
df2 = df2.join(df[cols])
df2['CodeName'] = '1'
# Result
result = pd.concat([df1,df2]).sort_index().reset_index(drop=True)
Valdi_Bo
  • 30,023
  • 4
  • 23
  • 41