3

I am trying to use panads to manage data that I am pulling from an API. The data has one column that is a nested list and I am trying to extract it out to their own columns.

The data looks like this so far:

id mail displayName propertiesRegistered createdDateTime
00000000-0000-0000-0000-000000000000 joe.user@email.com User, Joe ['address', 'mobilePhone', 'officePhone'] 2023-08-19T15:00:00.00Z

The desired output would look like:

id mail displayName address mobilePhone officePhone homePhone createdDateTime
00000000-0000-0000-0000-000000000000 joe.user@email.com User, Joe TRUE TRUE TRUE FALSE 2023-08-19T15:00:00.00Z

I've tried expanding, series, and pivot tables but cannot seem to figure it out. And not sure I'm even phrasing my question correctly in my searching. A lot of people and examples have the data made into additional rows, which I was able to do, but getting it to a single row is ideal.

Any help is greatly appreciated.

KrunkFu
  • 71
  • 4

4 Answers4

1

You can use df.apply() like in the following:

# Load test data
data = {
    'id': '00000000-0000-0000-0000-000000000000',
    'mail': 'joe.user@email.com',
    'displayName': 'User, Joe',
    'propertiesRegistered': "['address', 'mobilePhone', 'officePhone']",
    'createdDateTime': '2023-08-19T15:00:00.00Z'
}

df = pd.DataFrame(data, index=[0])

Test data

# Use df.apply with this function to expand properties to boolean values
def expand_properties(row):
    # Note that you need to have a precomputed list of all possible properties here
    all_properties = ['address', 'mobilePhone', 'officePhone', 'homePhone']
    for property in all_properties:
        row[property] = property in row['propertiesRegistered']
    return row

new_df = df.apply(expand_properties, axis=1)
new_df = new_df.drop(columns=['propertiesRegistered'])

This is what I see in my Jupyter notebook I used for testing: This is what I see in my Jupyter notebook I used for testing

EDIT: See here if you want to reorder columns in the resulting dataframe.

Jon Strutz
  • 302
  • 4
  • 9
  • Thank you so much! This worked perfectly. I ended up going with the suggestion from mozway since it was one additional line. – KrunkFu Aug 25 '23 at 20:31
1

You can use str.get_dummies:

valid_properties = ['address', 'mobilePhone', 'officePhone', 'homePhone']

df = df.join(df.pop('propertiesRegistered').agg('|'.join).str.get_dummies()
               .reindex(columns=valid_properties, fill_value=0)
               # or astype(bool) for real booleans
               .replace({1: 'TRUE', 0: 'FALSE'})
            )

Or a crosstab:

s = df.pop('propertiesRegistered').explode()
df = df.join(pd.crosstab(s.index, s)
               .reindex(columns=valid_properties, fill_value=0)
               .gt(0)
             )

Output:

                                      id                 mail displayName          createdDateTime address mobilePhone officePhone homePhone
0  00000000-0000-0000-0000-000000000000   joe.user@email.com   User, Joe   2023-08-19T15:00:00.00Z    TRUE        TRUE        TRUE     FALSE
mozway
  • 194,879
  • 13
  • 39
  • 75
  • Thank you! The get_dummies() method worked perfectly and pretty quick too, 800k rows in a little over 1 second. – KrunkFu Aug 25 '23 at 20:32
0

Try:


# apply literal_eval (if necessary):
from ast import literal_eval
df["propertiesRegistered"] = df["propertiesRegistered"].apply(literal_eval)

props = ["address", "mobilePhone", "officePhone", "homePhone"]

df = pd.concat(
    [
        df,
        df.pop("propertiesRegistered")
        .apply(lambda x: {p: p in x for p in props})
        .apply(pd.Series),
    ],
    axis=1,
)
print(df)

Prints:

                                     id                mail displayName          createdDateTime  address  mobilePhone  officePhone  homePhone
0  00000000-0000-0000-0000-000000000000  joe.user@email.com   User, Joe  2023-08-19T15:00:00.00Z     True         True         True      False
Andrej Kesely
  • 168,389
  • 15
  • 48
  • 91
0

Explode and unstack.

df = pd.DataFrame([
    {
        'id': '00000000-0000-0000-0000-000000000000',
        'mail': 'joe.user@email.com',
        'displayName': 'User, Joe',
        'propertiesRegistered': ['address', 'mobilePhone', 'officePhone'],
        'createdDateTime': pd.Timestamp('2023-08-19T15:00:00.00Z')
    }
])

s = df['propertiesRegistered'].explode().to_frame() \
    .assign(t=True) \
    .set_index('propertiesRegistered', append=True).unstack()
s.columns = s.columns.get_level_values(1)  # flatten by omitting level `t`

This creates a new data frame with the same indices as the old one that contains the elements of propertiesRegistered as columns:

propertiesRegistered  address  mobilePhone  officePhone
0                        True         True         True

Then just join it back with df.join(s). If propertiesRegistered is actually a string, just apply ast.literal_eval to it before doing this reshape.

ifly6
  • 5,003
  • 2
  • 24
  • 47