0

I am attempting to break out a DataFrame column into several varying ones based upon a function that parses the original column contents. They contain something that my function can turn into a dataframe with varying column names. All columns need to be added to the end of the existing dataframe. The columns should not duplicate in name. The below is a simplified version of what I'm trying to do. It errors out.

EDIT: One point of clarification, please disregard the fact that I have used a dict to form sub_transaction. The sub_transaction column in actuality has a lengthy XML in it that is turned into a DataFrame by parse_subtransaction. The simpler dict was just for example purposes. The important point is that a function must be used to parse it and that function returns a DataFrame.

original dataframe

transaction_id               sub_transaction
          abc1  {'id': 'abc1x', 'total': 10}
          abc2  {'id': 'abc2x', 'total': 20}
          abc3  {'id': 'abc3x', 'total': 30}
          abc4                            {}
          abc5               {'id': 'abc5x'}

desired dataframe outcome

transaction_id  sub_transaction_id  total
abc1                         abc1x     10
abc2                         abc2x     20
abc3                         abc3x     30
import pandas as pd

def parse_subtransaction(sub_transaction):
    return pd.DataFrame({
        'sub_transaction_id': [sub_transaction.get('id')],
        'total': [sub_transaction.get('total')]})

def main():
    df = pd.DataFrame({
        'transaction_id': ['abc1', 'abc2', 'abc3','abc4','abc5'],
        'sub_transaction': [
            {'id': 'abc1x', 'total': 10},
            {'id': 'abc2x', 'total': 20},
            {'id': 'abc3x', 'total': 30},
            {},
            {'id':'abc5x'}]
        })

    applied_df = df.apply(
        lambda row: parse_subtransaction(row['sub_transaction']),
        axis='columns',
        result_type='expand')

# ERROR: ValueError: If using all scalar values, you must pass an index

if (__name__ == "__main__"):
    main()
Dan
  • 151
  • 6
  • 1
    Beside the point, but `return df` is unreachable – wjandrea May 17 '22 at 23:08
  • Does this answer your question? [Split / Explode a column of dictionaries into separate columns with pandas](/q/38231591/4518341) – wjandrea May 17 '22 at 23:26
  • I've gone ahead and corrected the extra return for clarity and added an edit explaining that the fact that I used a dict for the example should be ignored. The key is that I need to parse the column using a function and that function returns a DataFrame that I need to append to the end of the original DataFrame. – Dan May 18 '22 at 01:24

5 Answers5

1

You could accomplish the same using:

df.join(pd.DataFrame(df.sub_transaction.tolist()))
 
  transaction_id               sub_transaction     id  total
0           abc1  {'id': 'abc1x', 'total': 10}  abc1x   10.0
1           abc2  {'id': 'abc2x', 'total': 20}  abc2x   20.0
2           abc3  {'id': 'abc3x', 'total': 30}  abc3x   30.0
3           abc4                            {}    NaN    NaN
4           abc5               {'id': 'abc5x'}  abc5x    NaN
Onyambu
  • 67,392
  • 3
  • 24
  • 53
0

One option is with pandas string get:

df.assign(sub_transaction_id = df.sub_transaction.str.get('id'), 
          total = df.sub_transaction.str.get('total'))

  transaction_id               sub_transaction sub_transaction_id  total
0           abc1  {'id': 'abc1x', 'total': 10}              abc1x   10.0
1           abc2  {'id': 'abc2x', 'total': 20}              abc2x   20.0
2           abc3  {'id': 'abc3x', 'total': 30}              abc3x   30.0
3           abc4                            {}               None    NaN
4           abc5               {'id': 'abc5x'}              abc5x    NaN

The apply returns your function per row, which I suspect is not what you want, you probably want a single DataFrame containing your extracts.

sammywemmy
  • 27,093
  • 4
  • 17
  • 31
  • 1
    To be clear. I just used a dict for the example. The real value in the column is a lot more complicated XML parse and absolutely requires its own function and returns a DataFrame that I simply need to append to the end of the old DataFrame. – Dan May 17 '22 at 23:10
  • if that's the case, my usual suggestion is to wrangle everything within Python (outside Pandas), before hooking it back to the main dataframe – sammywemmy May 17 '22 at 23:14
0

To use your own style:

import pandas as pd

def parse_subtransaction(sub_transaction):
    return ({'sub_transaction_id': sub_transaction.get('id'), 'total': sub_transaction.get('total')})
   

def main():
    df = pd.DataFrame({'transaction_id': ['abc1', 'abc2', 'abc3','abc4','abc5'],
                       'sub_transaction': [{'id': 'abc1x', 'total': 10}, {'id': 'abc2x', 'total': 20},
                                           {'id': 'abc3x', 'total': 30},{},{'id':'abc5x'}]})

    applied_df = df.apply(lambda row: parse_subtransaction(row['sub_transaction']), axis='columns', result_type='expand')
    final_df = pd.concat([df.iloc[: , :-1], applied_df], axis=1)
    print(final_df)
    
main()
rlpatrao
  • 565
  • 1
  • 8
  • 15
0

parse_subtransaction should return a dict or Series, not a DataFrame.*

def parse_subtransaction(sub_transaction):
    return {
        'sub_transaction_id': sub_transaction.get('id'),
        'total': sub_transaction.get('total')}

Then to rejoin, we can use a variation of joris's solution:

pd.concat([df.drop(columns='sub_transaction'), applied_df], axis=1)

* Although I'm not sure why exactly. I looked at the docs and type annotation, but couldn't find anything that specified the func parameter's return type precisely.

wjandrea
  • 28,235
  • 9
  • 60
  • 81
0

I got answers that catered to the dict scenario, but I really just needed people to assume that the function in the apply always returned a DataFrame as the starting point. In reality, I'm parsing an XML. Here is the solution that ultimately worked:

def parse_xml(xml):
    xml_dict = xmltodict.parse(xml)
    df = pd.json_normalize(xml_dict)
    df.columns = df.columns.str.replace("ns0", "", regex=False)
    df.columns = df.columns.str.replace("@xmlns", "", regex=False)
    df.columns = df.columns.str.replace(":", "", regex=False)
    df.columns = df.columns.str.replace(".", "_", regex=False)
    df.columns = df.columns.str.rstrip('_')
    return df

def parse_xmls(df, col='h_xml'):
    print("Parsing XML's")
    right_df_list = []
    for index, row in df.iterrows():
        xml_df = parse_xml(row['h_xml'])
        xml_dict = xml_df.to_dict()
        right_df_list.append(xml_dict)

    right_df = pd.DataFrame.from_dict(right_df_list, orient='columns')
    right_df = right_df.applymap(lambda col : col[0] if type(col) is dict else col)
    df = pd.merge(df, right_df, left_index=True, right_index=True)
    return df
Dan
  • 151
  • 6