Pandas multi-column explode not working on Microsoft Azure

Question

I am trying to run Python code in Microsoft Azure and I am having issues with getting the explode function from the Pandas library to not throw errors.

The code which I have created runs perfectly fine locally on Spyder (with Pandas version 1.4.4) but it does not work when I use Azure to run my code. I know that the error is not occurring because there are rows which have lists of different lengths, as the code does run as desired when I run the code locally on Python with the same data. This issue only exists when I run my code using Azure.

When I try to use the multi-column explode function (which is available for Pandas version 1.3.0 and later), I get the following error: column must be a scalar

If I instead use the pandas.apply function with a custom-defined lambda function to explode only my desired columns (based on this StackOverflow answer), while keeping the rest of the columns as is, I get the following error: cannot import name 'AggFuncType' from 'pandas._typing'

I have included this block of code os.system("pip install pandas==1.4.4") in the code block which I made on Microsoft Azure, so I am not sure if my code is still somehow using an outdated version of Azure despite the fact that I have explicitly been asking Azure to get Pandas 1.4.4, or if the error is coming from somewhere else within my code.

Edit

I am using the ML Studio in Microsoft Azure to create a machine learning pipeline.

An example of my code (I changed the variable names but the idea is the same) is new_df = new_df.explode(column = ['name', 'id'], ignore_index = True). In my code, new_df is a pandas DataFrame which contains two columns called name and id which contain lists, alongside 15 or so other columns which contain single values (I got the data using a web scraper). The name column contains a list of names, and the id column contains a list of corresponding IDs. I know for a fact that these lists have the same lengths as this code is able to run without errors locally, and it works as desired.

Edit

Apparently, Microsoft Azure requires you to use Pandas version 1.0.4 in order to utilize the ML Studio in Azure, so the code which I was attempting to use to install a newer version of pandas was actually not doing anything, as Azure automatically forces you to use 1.0.4 instead of the version which you want.

Can you tell me which Azure service are you using to run your pandas code? Also, Can you update your code in your question so I can try it in my environment? — SiddheshDesai, Mar 30 '23 at 03:42
Thank you for asking. I have just updated my post with more context. Please let me know if you need any further clarification. — Johemian, Mar 30 '23 at 14:44

score 0 · Answer 1 · answered Mar 31 '23 at 15:12

I tried the below code in my Azure ML studio to use Explode function and call lamda function:-

Code:-

import pandas as pd
import numpy as np
import os

# Create a sample dataframe with a column of lists
df = pd.DataFrame({'name': [['Alice', 'Bob'], ['Charlie', 'David', 'Eve']], 
                   'id': [[1, 2], [3, 4, 5]], 
                   'value': [10, 20]})

# Print the dataframe
print(df)

# Try to use the explode function with pandas version 1.3.0 or later
try:
    df_exploded = df.explode(column=['name', 'id'], ignore_index=True)
    print(df_exploded)
except Exception as e:
    print("Error using explode:", e)

# Try to use a custom lambda function with pandas apply
try:
    agg_func = pd.api.types.AggFuncType
    df_exploded = df.apply(lambda x: pd.Series(list(zip(*x['name'], x['id']))), axis=1).stack().reset_index(level=1, drop=True).to_frame().rename(columns={0: 'name', 1: 'id'}).join(df.drop(['name', 'id'], 1), how='left')
    print(df_exploded)
except Exception as e:
    print("Error using apply:", e)

# Install pandas 1.4.4 using pip
os.system("pip install pandas==1.4.4")

# Check the version of pandas being used
print(pd.__version__)

# Try to use the explode function with pandas version 1.4.4
try:
    df_exploded = df.explode(column=['name', 'id'], ignore_index=True)
    print(df_exploded)
except Exception as e:
    print("Error using explode:", e)

Output ML :-

I received the ouput but it threw an exception like below:-

enter image description here

I tried changing the pandas version to 1.2.5 , 1.1.5 and 1.4.4 but still received the same error as above.

Method 1:-

As an alternative you can use, the below code for the lamda function and to avoid the above exception:-

Code:-

import pandas as pd
import numpy as np
import os

# Create a sample dataframe with a column of lists
df = pd.DataFrame({'name': [['Alice', 'Bob'], ['Charlie', 'David', 'Eve']], 
                   'id': [[1, 2], [3, 4, 5]], 
                   'value': [10, 20]})

# Print the dataframe
print(df)

# Try to use the explode function with pandas version 1.3.0 or later
try:
    df_exploded = df.explode('name').explode('id').reset_index(drop=True)
    print(df_exploded)
except Exception as e:
    print("Error using explode:", e)

# Try to use a custom lambda function with pandas apply
try:
    df_exploded = df.apply(lambda x: pd.Series(list(zip(*x['name'], x['id']))), axis=1).stack().reset_index(level=1, drop=True).to_frame().rename(columns={0: 'name', 1: 'id'}).join(df.drop(labels=['name', 'id'], axis=1), how='left')

    print(df_exploded)
except Exception as e:
    print("Error using apply:", e)

# Install pandas 1.4.4 using pip
os.system("pip install pandas==1.4.4")

# Check the version of pandas being used
print(pd.__version__)

# Try to use the explode function with pandas version 1.4.4
try:
    df_exploded = df.explode('name').explode('id').reset_index(drop=True)
    print(df_exploded)
except Exception as e:
    print("Error using explode:", e)

Output ML :-

enter image description here

Output of same code with my local machine:-

enter image description here

Method 2:-

You can also apply the lamda function to the DataFrame element-by-element using the applymap method, as an option.

Code:-

import  pandas  as  pd
# Create a sample dataframe with a column of lists
df = pd.DataFrame({'name': [['Alice', 'Bob'], ['Charlie', 'David', 'Eve']],

'id': [[1, 2], [3, 4, 5]],

'value': [10, 20]})

# Print the dataframe
print(df)

# Define a function to apply to each row of the dataframe
def  explode_row(x):
names = x['name']
ids = x['id']

return  pd.DataFrame({'name': names, 'id': ids, 'value': x['value']})

# Apply the explode_row function to each row of the dataframe
df_exploded = df.apply(explode_row, axis=1).reset_index(drop=True)
# Print the exploded dataframe
print(df_exploded)

Output ML:-

enter image description here

Clear output:-

import  pandas  as  pd

# Create a sample dataframe with a column of lists

df = pd.DataFrame({'name': [['Alice', 'Bob'], ['Charlie', 'David', 'Eve']],

'id': [[1, 2], [3, 4, 5]],

'value': [10, 20]})

# Print the dataframe

print(df)

# Define a function to apply to each row of the dataframe

def  explode_row(x):

names = x['name']

ids = x['id']

return  pd.DataFrame({'name': names, 'id': ids, 'value': x['value']})
 

# Apply the explode_row function to each row of the dataframe

df_exploded = df.explode(column=['name', 'id'], ignore_index=True).reset_index(drop=True)

# Print the exploded dataframe

print(df_exploded)

Output ML:-

enter image description here

Pandas multi-column explode not working on Microsoft Azure

Edit

Edit

1 Answers1