Here is one possible approach that avoids using apply
Raw DataFrame
date tableNameFrom tableNameJoin attributeName
0 29-03-2019 film language [film.languageId, language.languageID]
1 30-03-2019 inventory rental [invetory.inventoryId, rental.filmId]
Step 1 - (1) Split attributeName
into 2 separate columns using ,
as the separator, (2) remove unwanted square brackets ([
or ]
), (3) drop unwanted columns
# 1
df[['tableName','attributeName2']] = df['attributeName'].str.split(',', expand=True)
# 2
df['tableName'] = df['tableName'].str.strip('[')
df['attributeName2'] = df['attributeName2'].str.strip(']')
# 3
df.drop(['attributeName','tableNameFrom','tableNameJoin'], axis=1, inplace=True)
print(df)
date tableName attributeName2
0 29-03-2019 film.languageId language.languageID
1 30-03-2019 invetory.inventoryId rental.filmId
Step 2 - Finally, use this SO post to split the rows from the columns tableName
and attributeName2
into separate columns
df_match = (df.set_index(['date'])
.stack()
.str.split('.', expand=True)
.stack()
.unstack(-1)
.reset_index(-1, drop=True)
.reset_index()
)
df_match.columns = ['date','tableName','attributeName']
print(df_match)
date tableName attributeName
0 29-03-2019 film languageId
1 29-03-2019 language languageID
2 30-03-2019 invetory inventoryId
3 30-03-2019 rental filmId
Details
.set_index(['date']
- set the columns that are to be kept as the index of the DataFrame
.stack()
- stack rows
.str.split('.', expand=True)
- call str.split on period (.
) to split those cells into separate columns
.stack()
- get rid of NULL
values, so call stack
again
.unstack(-1)
- since we want the last level of the index to become our columns, so unstack using
unstack(-1)
(unstack on the last level)
- I have shown this in detail below1
.reset_index(-1, drop=True)
- eliminate superfluous last level using reset_index(-1)
1 Here is the output before the .unstack(-1)
step and an explanation of why we need to use -1
inside unstack()
df_intermediate = (df.set_index(['date'])
.stack()
.str.split('.', expand=True)
.stack()
)
print(df_intermediate)
date
29-03-2019 tableName 0 film
1 languageId
attributeName2 0 language
1 languageID
30-03-2019 tableName 0 invetory
1 inventoryId
attributeName2 0 rental
1 filmId
dtype: object
- this shows that we want the last index level (with values 0, 1) to become the columns so we choose
-1
in .unstack(-1)