Break out DataFrame column into multiple columns using apply

Question

I am attempting to break out a DataFrame column into several varying ones based upon a function that parses the original column contents. They contain something that my function can turn into a dataframe with varying column names. All columns need to be added to the end of the existing dataframe. The columns should not duplicate in name. The below is a simplified version of what I'm trying to do. It errors out.

EDIT: One point of clarification, please disregard the fact that I have used a dict to form sub_transaction. The sub_transaction column in actuality has a lengthy XML in it that is turned into a DataFrame by parse_subtransaction. The simpler dict was just for example purposes. The important point is that a function must be used to parse it and that function returns a DataFrame.

original dataframe

transaction_id               sub_transaction
          abc1  {'id': 'abc1x', 'total': 10}
          abc2  {'id': 'abc2x', 'total': 20}
          abc3  {'id': 'abc3x', 'total': 30}
          abc4                            {}
          abc5               {'id': 'abc5x'}

desired dataframe outcome

transaction_id  sub_transaction_id  total
abc1                         abc1x     10
abc2                         abc2x     20
abc3                         abc3x     30

import pandas as pd

def parse_subtransaction(sub_transaction):
    return pd.DataFrame({
        'sub_transaction_id': [sub_transaction.get('id')],
        'total': [sub_transaction.get('total')]})

def main():
    df = pd.DataFrame({
        'transaction_id': ['abc1', 'abc2', 'abc3','abc4','abc5'],
        'sub_transaction': [
            {'id': 'abc1x', 'total': 10},
            {'id': 'abc2x', 'total': 20},
            {'id': 'abc3x', 'total': 30},
            {},
            {'id':'abc5x'}]
        })

    applied_df = df.apply(
        lambda row: parse_subtransaction(row['sub_transaction']),
        axis='columns',
        result_type='expand')

# ERROR: ValueError: If using all scalar values, you must pass an index

if (__name__ == "__main__"):
    main()

Does this answer your question? [Split / Explode a column of dictionaries into separate columns with pandas](/q/38231591/4518341) — wjandrea, May 17 '22 at 23:26
I've gone ahead and corrected the extra return for clarity and added an edit explaining that the fact that I used a dict for the example should be ignored. The key is that I need to parse the column using a function and that function returns a DataFrame that I need to append to the end of the original DataFrame. — Dan, May 18 '22 at 01:24

score 1 · Answer 1 · answered May 17 '22 at 23:11

You could accomplish the same using:

df.join(pd.DataFrame(df.sub_transaction.tolist()))
 
  transaction_id               sub_transaction     id  total
0           abc1  {'id': 'abc1x', 'total': 10}  abc1x   10.0
1           abc2  {'id': 'abc2x', 'total': 20}  abc2x   20.0
2           abc3  {'id': 'abc3x', 'total': 30}  abc3x   30.0
3           abc4                            {}    NaN    NaN
4           abc5               {'id': 'abc5x'}  abc5x    NaN

sammywemmy · Answer 2 · 2022-05-17T23:12:05.257

0

One option is with pandas string get:

df.assign(sub_transaction_id = df.sub_transaction.str.get('id'), 
          total = df.sub_transaction.str.get('total'))

  transaction_id               sub_transaction sub_transaction_id  total
0           abc1  {'id': 'abc1x', 'total': 10}              abc1x   10.0
1           abc2  {'id': 'abc2x', 'total': 20}              abc2x   20.0
2           abc3  {'id': 'abc3x', 'total': 30}              abc3x   30.0
3           abc4                            {}               None    NaN
4           abc5               {'id': 'abc5x'}              abc5x    NaN

The apply returns your function per row, which I suspect is not what you want, you probably want a single DataFrame containing your extracts.

edited May 17 '22 at 23:12

answered May 17 '22 at 23:07

sammywemmy

27,093
4
17
31

1

To be clear. I just used a dict for the example. The real value in the column is a lot more complicated XML parse and absolutely requires its own function and returns a DataFrame that I simply need to append to the end of the old DataFrame. – Dan May 17 '22 at 23:10
if that's the case, my usual suggestion is to wrangle everything within Python (outside Pandas), before hooking it back to the main dataframe – sammywemmy May 17 '22 at 23:14

score 0 · Answer 3 · answered May 17 '22 at 23:37

To use your own style:

import pandas as pd

def parse_subtransaction(sub_transaction):
    return ({'sub_transaction_id': sub_transaction.get('id'), 'total': sub_transaction.get('total')})
   

def main():
    df = pd.DataFrame({'transaction_id': ['abc1', 'abc2', 'abc3','abc4','abc5'],
                       'sub_transaction': [{'id': 'abc1x', 'total': 10}, {'id': 'abc2x', 'total': 20},
                                           {'id': 'abc3x', 'total': 30},{},{'id':'abc5x'}]})

    applied_df = df.apply(lambda row: parse_subtransaction(row['sub_transaction']), axis='columns', result_type='expand')
    final_df = pd.concat([df.iloc[: , :-1], applied_df], axis=1)
    print(final_df)
    
main()

score 0 · Answer 4 · answered May 18 '22 at 00:05

parse_subtransaction should return a dict or Series, not a DataFrame.*

def parse_subtransaction(sub_transaction):
    return {
        'sub_transaction_id': sub_transaction.get('id'),
        'total': sub_transaction.get('total')}

Then to rejoin, we can use a variation of joris's solution:

pd.concat([df.drop(columns='sub_transaction'), applied_df], axis=1)

* Although I'm not sure why exactly. I looked at the docs and type annotation, but couldn't find anything that specified the func parameter's return type precisely.

score 0 · Answer 5 · answered May 19 '22 at 03:46

I got answers that catered to the dict scenario, but I really just needed people to assume that the function in the apply always returned a DataFrame as the starting point. In reality, I'm parsing an XML. Here is the solution that ultimately worked:

def parse_xml(xml):
    xml_dict = xmltodict.parse(xml)
    df = pd.json_normalize(xml_dict)
    df.columns = df.columns.str.replace("ns0", "", regex=False)
    df.columns = df.columns.str.replace("@xmlns", "", regex=False)
    df.columns = df.columns.str.replace(":", "", regex=False)
    df.columns = df.columns.str.replace(".", "_", regex=False)
    df.columns = df.columns.str.rstrip('_')
    return df

def parse_xmls(df, col='h_xml'):
    print("Parsing XML's")
    right_df_list = []
    for index, row in df.iterrows():
        xml_df = parse_xml(row['h_xml'])
        xml_dict = xml_df.to_dict()
        right_df_list.append(xml_dict)

    right_df = pd.DataFrame.from_dict(right_df_list, orient='columns')
    right_df = right_df.applymap(lambda col : col[0] if type(col) is dict else col)
    df = pd.merge(df, right_df, left_index=True, right_index=True)
    return df

Break out DataFrame column into multiple columns using apply

5 Answers5