0

I have a dataframe with some columns delimited with '|', and I need to flatten this dataframe. Example:

name  type
a      l
b      m
c|d|e  n

For this df, I want to flatten it to:

   name type
    a    l
    b    m
    c    n
    d    n
    e    n

To do this, I used this command:

df = df.assign(name=df.name.str.split('|')).explode(column).drop_duplicates()

Now, I want do one more thing besides above flatten operation:

   name type  co_occur
    a    l
    b    m
    c    n    d
    c    n    e
    d    n    e  

That is, not only split the 'c|d|e' into two rows, but also create a new column which contains a 'co_occur' relationship, in which 'c' and 'd' and 'e' co-occur with each other.

I don't see an easy way to do this by modifying:

df = df.assign(name=df.name.str.split('|')).explode(column).drop_duplicates()
marlon
  • 6,029
  • 8
  • 42
  • 76
  • if you have 4 original values like `c|d|e|f`, you want to end up with all possible combinations (6 in this case) instead of the 4 rows that your current methods would give? – Ben.T Nov 05 '21 at 19:46
  • Perhaps this link might help you. (https://stackoverflow.com/questions/48010030/create-multiple-new-columns-based-on-pipe-delimited-column-in-pandas) – Talal Siddiqui Nov 05 '21 at 19:48
  • @Ben.T, yes, put individual ones on each row and create non-duplicate relationships between them, meaning that c->d only, not d->c again. – marlon Nov 05 '21 at 20:02

1 Answers1

0

I think this is what you want. Use combinations and piece everything together

from itertools import combinations
import io

data = '''name  type
a      l
b      m
c|d|e  n
j|k    o
f|g|h|i    p
'''
df = pd.read_csv(io.StringIO(data), sep=' \s+', engine='python')

# hold the new dataframes as you iterate via apply()
df_hold = []
def explode_combos(x):
    combos = list(combinations(x['name'].split('|'),2))
    # print(combos)
    # print(x['type'])
    df_hold.append(pd.DataFrame([{'name':c[0], 'type':x['type'], 'co_cur': c[1]} for c in combos]))
    return

# only apply() to those rows that need to be exploded
dft = df[df['name'].str.contains('\|')].apply(explode_combos, axis=1)
# concatenate the result
dfn = pd.concat(df_hold)
# add back to rows that weren't operated on (see the ~)
df_final = pd.concat([df[~df['name'].str.contains('\|')], dfn]).fillna('')

  name type co_cur
0    a    l
1    b    m
0    c    n      d
1    c    n      e
2    d    n      e
0    j    o      k
0    f    p      g
1    f    p      h
2    f    p      i
3    g    p      h
4    g    p      i
5    h    p      i
Jonathan Leon
  • 5,440
  • 2
  • 6
  • 14