1

Given the following DataFrame:

   Category Area               Country Code Function Last Name     LanID  Spend1  Spend2  Spend3  Spend4  Spend5
0      Bisc   EE                  RU02,UA02       Mk     Smith    df3432     1.0     NaN     NaN     NaN     NaN
1      Bisc   EE                       RU02       Mk      Bibs    fdss34     1.0     NaN     NaN     NaN     NaN
2      Bisc   EE               UA02,EURASIA       Mk      Crow   fdsdr43     1.0     NaN     NaN     NaN     NaN
3      Bisc   WE                       FR31       Mk     Ellis   fdssdf3     1.0     NaN     NaN     NaN     NaN
4      Bisc   WE                  BE32,NL31       Mk     Mower   TOZ1720     1.0     NaN     NaN     NaN     NaN
5      Bisc   WE             FR31,BE32,NL31      LKU      Elan   SKY8851     1.0     1.0     1.0     1.0     1.0
6      Bisc   SE                       IT31       Mk    Bobret    3dfsfg     1.0     NaN     NaN     NaN     NaN
7      Bisc   SE                       GR31       Mk   Concept  MOSGX009     1.0     NaN     NaN     NaN     NaN
8      Bisc   SE   RU02,IT31,GR31,PT31,ES31      LKU     Solar   MSS5723     1.0     1.0     1.0     1.0     1.0
9      Bisc   SE        IT31,GR31,PT31,ES31       Mk      Brix    fdgd22     NaN     1.0     NaN     NaN     NaN
10     Choc   CE   RU02,CZ31,SK31,PL31,LT31      Fin    Ocoser    43233d     NaN     1.0     NaN     NaN     NaN
11     Choc   CE        DE31,AT31,HU31,CH31      Fin     Smuth     4rewf     NaN     1.0     NaN     NaN     NaN
12     Choc   CE              BG31,RO31,EMA      Fin    Momocs    hgghg2     NaN     1.0     NaN     NaN     NaN
13     Choc   WE             FR31,BE32,NL31      Fin   Bruntly    ffdd32     NaN     NaN     NaN     NaN     1.0
14     Choc   WE             FR31,BE32,NL31       Mk      Ofer  BROGX011     NaN     1.0     1.0     NaN     NaN
15     Choc   WE             FR31,BE32,NL31       Mk       Hem   NZJ3189     NaN     NaN     NaN     1.0     1.0
16      G&C   NE                  UA02,SE31       Mk       Cre   ORY9499     1.0     NaN     NaN     NaN     NaN
17      G&C   NE                       NO31       Mk      Qlyo   XVM7639     1.0     NaN     NaN     NaN     NaN
18      G&C   NE   GB31,NO31,SE31,IE31,FI31       Mk      Omny   LOX1512     NaN     1.0     1.0     NaN     NaN

I would like to get it exported into a nested Dict with the below structure:

    {RU02:  {Bisc:  {EE:    {Mkt:   {Spend1:    {df3432:    Smith}
                                                {fdss34:     Bibs}
            {Bisc:  {SE:    {LKU:   {Spend1:    {MSS5723:   Solar}
                                    {Spend2:    {MSS5723:   Solar}
                                    {Spend3:    {MSS5723:   Solar}
                                    {Spend4:    {MSS5723:   Solar}
                                    {Spend5:    {MSS5723:   Solar}
            {Choc:  {CE:    {Fin:   {Spend2:    {43233d:   Ocoser}
            .....

    {UA02:  {Bisc:  {EE:    {Mkt:   {Spend1:    {df3432:    Smith}
                                                {ffdsdr43:   Crow}
            {G&C:   {NE:    {Mkt:   {Spend1:    {ORY9499:     Cre}
    .....

So essentially, in this Dict I'm trying to track for each CountryCode, what is the list of LastNames+LandIDs, per Spend category (Spend1,Spend2, etc.) and their attributes (Function, Category, Area).

The DataFrame is not very large (less than 200rows), but it contains almost all types of combinations between Category/Area/Country Code as well as LastNames and their Spend categories (many-to-many).

My challenge is that i'm unable to figure out how to clearly conceptualise the steps i need to do in order to prepare the DataFrame properly for export to Dict....

What i figured so far is that i would need:

  1. a way to slice the contents of the "Country Code" column based on the "," separator: DONE
  2. create new columns based on unique Country Codes, and have 1 in each row where that column code is preset: DONE
  3. set the index of the DataFrame recursively to each of the newly added columns
  4. move into a new DataFrame each rows for each Country Code where there is data
  5. export all the new DataFrames to Dicts, and then merge them

Not sure if steps 3-6 is the best way to go about this though, as i'm having difficulties still to understand how pd.DataFrame.to_dict should be configured for my case (if that's even possible)...

Highly appreciate your help on the coding side, but also in briefly explaining your thought process for each stage.


Here is how far i got on my own..

#keeping track of initial order of columns
initialOrder = list(df.columns.values)

# split the Country Code by ","
CCodeNoCommas= [item for items in df['Country Code'].values for item in items.split(",")]

# add only the UNIQUE Country Codes -via set- as new columns in the DataFrame,
#with NaN for row values
df = pd.concat([df,pd.DataFrame(columns=list(set(CCodeNoCommas)))])

# reordering columns to have the newly added ones at the end
reordered = initialOrder + [c for c in df.columns if c not in initialOrder]
df = df[reordered]


# replace NaN with 1 in the newly added columns (Country Codes), where the same Country code
# exists in the initial column "Country Code"; do this for each row

CCodeUniqueOnly = set(CCodeNoCommas)
for c in CCodeUniqueOnly:   
    CCodeIsPresent_rowIndex = df.index[df['Country Code'].str.contains(c)]

    #print (CCodeIsPresent_rowIndex)
    df.loc[CCodeIsPresent_rowIndex, c] = 1

# no clue what do do next ??
aragalie
  • 175
  • 2
  • 9

1 Answers1

1

If you re-shape your dataframe into the right format, you can use the handy recursive dictionary function from the answer by @DSM to this question. The goal is to get a dataframe where each row contains only one "entry" - a unique combination of the columns you're interested in.

First, you need to split your country code strings into lists:

df['Country Code'] = df['Country Code'].str.split(',')

And then expand those lists into multiple rows (using @RomanPekar's technique from this question):

s = df.apply(lambda x: pd.Series(x['Country Code']),axis=1) \
    .stack().reset_index(level=1, drop=True)
s.name = 'Country Code'
df = df.drop('Country Code', axis=1).join(s).reset_index(drop=True)

Then you can reshape the Spend* columns into rows, where there's a row for each Spend* column where the value is not nan.

spend_cols = ['Spend1', 'Spend2', 'Spend3', 'Spend4', 'Spend5']
df = df.groupby('Country Code') \
    .apply(lambda g: g.join(pd.DataFrame(g[spend_cols].stack()) \
    .reset_index(level=1)['level_1'])) \
    .reset_index(drop=True)

Now you have a dataframe where each level in your nested dictionary is its own column. So you can use this recursive dictionary function:

def recur_dictify(frame):
    if len(frame.columns) == 1:
        if frame.values.size == 1: return frame.values[0][0]
        return frame.values.squeeze()
    grouped = frame.groupby(frame.columns[0])
    d = {k: recur_dictify(g.ix[:,1:]) for k,g in grouped}
    return d

And apply it only to the columns you want to produce the nested dictionary, listed in the order in which they should nest:

cols = ['Country Code', 'Category', 'Area', 'Function', 'level_1', 'LanID', 'Last Name']
d = recur_dictify(df[cols])

That should produce your desired result.


All in one piece:

df['Country Code'] = df['Country Code'].str.split(',')
s = df.apply(lambda x: pd.Series(x['Country Code']),axis=1) \
    .stack().reset_index(level=1, drop=True)
s.name = 'Country Code'
df = df.drop('Country Code', axis=1).join(s).reset_index(drop=True)

spend_cols = ['Spend1', 'Spend2', 'Spend3', 'Spend4', 'Spend5']
df = df.groupby('Country Code') \
    .apply(lambda g: g.join(pd.DataFrame(g[spend_cols].stack()) \
    .reset_index(level=1)['level_1'])) \
    .reset_index(drop=True)

def recur_dictify(frame):
    if len(frame.columns) == 1:
        if frame.values.size == 1: return frame.values[0][0]
        return frame.values.squeeze()
    grouped = frame.groupby(frame.columns[0])
    d = {k: recur_dictify(g.ix[:,1:]) for k,g in grouped}
    return d

cols = ['Country Code', 'Category', 'Area', 'Function', 'level_1', 'LanID', 'Last Name']
d = recur_dictify(df[cols])
ASGM
  • 11,051
  • 1
  • 32
  • 53