0

I have a Pandas Dataframe that contains multiple comma separated values across 3 columns.

Dataframe:

df = pd.DataFrame({'City': ['Boston', 'Atlanta', 'Chicago', 'Chicago', 'Phoenix'],
                        'State': ['MA', 'GA', 'IL', 'IL', 'AZ'],
                        'Country': ['US', 'US', 'US', 'US', 'US'],
                        'Value1': ['a', 'a,b,c', 'a,b,c,d', 'a', 'a,b'],
                        'Value2': ['b', 'd,e,f', 'e,f,g', 'b,c', 'c,d,e'],
                        'Value3': ['c', 'g,h,i', 'h,i,j', 'd', 'f,g,h,i']
                   })

enter image description here

What I want:

I'd like to split it onto duplicate rows so that City, State, and Country are essentially duplicated but Value1, Value2, and Value3 are split by comma onto new rows.

enter image description here OR enter image description here

As the image above shows if the number of values don't match I'd like to just put a blank or an N/A in the field instead. This is purely based on the position of the element so Value1 position 1 matches with Value 2 and 3 positon 1.

The issue I'm having is that there's no guarantee that Value1, Value2, and Value3 will contain the same amount of comma separated values, so trying to use df.explode() gives errors.

A simpler solution might be to just try to add commas to the end of the cells before exploding but I'm unsure how to do that. For example make [a,b,c] [d,e] [f] go to [a,b,c] [d,e,] [f,,]? I'm at my wits end trying to do this. Any help would be super appreciated.

Bill Huang
  • 4,491
  • 2
  • 13
  • 31
configure.ng
  • 159
  • 1
  • 11

3 Answers3

0

A slow-and-dirty way can be implemented by processing each row separately. The key idea here is to exploit the capability of converting a list of lists with unequal lengths (i.e., a ragged array) to a DataFrame with the default DataFrame constructor.

Code

n_vals = 3  # Value 1,2,3
ls_df0 = []  # fragments produced by each row to be concatenated

for tup in df.itertuples(index=False):
    # collect split strings
    ls = [getattr(tup, f"Value{i}").split(",") for i in range(1, n_vals + 1)]
    # convert a list of unequal-length lists into a dataframe
    df0 = pd.DataFrame(ls).transpose()
    df0["City"] = tup.City
    df0["State"] = tup.State
    df0["Country"] = tup.Country
    ls_df0.append(df0)

df_want = pd.concat(ls_df0)\
    .rename(columns={i: f"Value{i+1}" for i in range(n_vals)})\
    .fillna("N/A")\
    [df.columns]

Result

print(df_want)
      City State Country Value1 Value2 Value3
0   Boston    MA      US      a      b      c
0  Atlanta    GA      US      a      d      g
1  Atlanta    GA      US      b      e      h
2  Atlanta    GA      US      c      f      i
0  Chicago    IL      US      a      e      h
1  Chicago    IL      US      b      f      i
2  Chicago    IL      US      c      g      j
3  Chicago    IL      US      d    N/A    N/A
0  Chicago    IL      US      a      b      d
1  Chicago    IL      US    N/A      c    N/A
0  Phoenix    AZ      US      a      c      f
1  Phoenix    AZ      US      b      d      g
2  Phoenix    AZ      US    N/A      e      h
3  Phoenix    AZ      US    N/A    N/A      i

Note: I have also tried df.explode() which is supposed to be the standard way of expanding a list into multiple rows. Unfortunately, it won't work for multiple columns. Therefore I doubt whether a fast-and-clean way exists or not.

Bill Huang
  • 4,491
  • 2
  • 13
  • 31
0

Hopefully this can add to the quest for a solution... This approach uses pandas.explode but it appears as though the explode only works when all lists in the columns we're exploding are the same length, and won't work on an individual row max list length basis:

import numpy as np

cols = ['Value1','Value2','Value3']

def get_max_len(df):
    return max([df[col].apply(lambda x: len(x.split(','))).max() for col in cols])

def extend_lists(row):
    
#     max_len = max([len(x.split(',')) for x in row]) #<-- the explode doesn't work when we extend
                                                      #    lists to the row max value
    
    max_len= get_max_len(df) #<-- the explode only works when we extend the 
                             #    lists to the max value of each Value col
    
    r = [x.split(',')+[np.nan if len(x.split(','))<max_len else '' for i in range(max_len - len(x.split(',')))] for x in row]
    return r

df2 = pd.concat([df[['City','State','Country']],
                 df[cols].apply(extend_lists,axis=0)
                ]
                ,axis=1)

df2.apply(pd.Series.explode).dropna(subset=cols,how='all')

#       City State Country Value1 Value2 Value3
# 0   Boston    MA      US      a      b      c
# 1  Atlanta    GA      US      a      d      g
# 1  Atlanta    GA      US      b      e      h
# 1  Atlanta    GA      US      c      f      i
# 2  Chicago    IL      US      a      e      h
# 2  Chicago    IL      US      b      f      i
# 2  Chicago    IL      US      c      g      j
# 2  Chicago    IL      US      d    NaN    NaN
# 3  Chicago    IL      US      a      b      d
# 3  Chicago    IL      US    NaN      c    NaN
# 4  Phoenix    AZ      US      a      c      f
# 4  Phoenix    AZ      US      b      d      g
# 4  Phoenix    AZ      US    NaN      e      h
# 4  Phoenix    AZ      US    NaN    NaN      i
user6386471
  • 1,203
  • 1
  • 8
  • 17
0

You can split and extended the values to a same max range with a user-defined function, then apply such function followed by pd.Series.explode in your data.

To avoid creating new temporary dataframes with further concatenation, you can use the non-value columns as index, then reset_index after data manipulation.

# Define a function to split and extend with an empty string (you can change it to np.nan to extend with NaN)
def split_extended(row, extension=''):
    values = row.str.split(',')
    max_len = values.map(len).max()
    value_list = values.map(lambda x: x if len(x)==max_len else x + [extension for _ in range(max_len-len(x))])
    return value_list

# Select the non-value columns to use them as index
non_value_cols = df.columns[~df.columns.str.startswith('Value')].tolist()

# Update the dataframe to get your desired output
df = df.set_index(non_value_cols).apply(extend_list, axis=1).apply(pd.Series.explode).reset_index()
print(df)

Output:

       City State Country Value1 Value2 Value3
0    Boston    MA      US      a      b      c
1   Atlanta    GA      US      a      d      g
2   Atlanta    GA      US      b      e      h
3   Atlanta    GA      US      c      f      i
4   Chicago    IL      US      a      e      h
5   Chicago    IL      US      b      f      i
6   Chicago    IL      US      c      g      j
7   Chicago    IL      US      d              
8   Chicago    IL      US      a      b      d
9   Chicago    IL      US             c       
10  Phoenix    AZ      US      a      c      f
11  Phoenix    AZ      US      b      d      g
12  Phoenix    AZ      US             e      h
13  Phoenix    AZ      US                    i

This code returns the first target-output from your question. You can replace the empty strings to np.nan by changing the function default value to extension='np.nan', or by using df.replace('', np.nan).

Cainã Max Couto-Silva
  • 4,839
  • 1
  • 11
  • 35