3

I have a dataframe "bb" like this:

Response                                Unique Count
I love it so much!                      246_0    1
This is not bad, but can be better.     246_1    2
Well done, let's do it.                 247_0    1

If count is lager than 1, I would like to split the string and make the dataframe "bb" become this: (result I expected)

Response                                Unique
I love it so much!                      246_0    
This is not bad                         246_1_0    
but can be better.                      246_1_1
Well done, let's do it.                 247_0

My code:

bb = DataFrame(bb[bb['Count'] > 1].Response.str.split(',').tolist(), index=bb[bb['Count'] > 1].Unique).stack()
bb = bb.reset_index()[[0, 'Unique']]
bb.columns = ['Response','Unique']
bb=bb.replace('', np.nan)
bb=bb.dropna()
print(bb)

But the result is like this:

           Response  Unique
0  This is not bad    246_1
1  but can be better. 246_1

How can I keep the original dataframe in this case?

Gonçalo Peres
  • 11,752
  • 3
  • 54
  • 83
Lara19
  • 615
  • 1
  • 9
  • 20

2 Answers2

3

First split only values per condition with to new helper Series and then add counter values by GroupBy.cumcount only per duplicated index values by Index.duplicated:

s = df.loc[df.pop('Count') > 1, 'Response'].str.split(',', expand=True).stack()
df1 = df.join(s.reset_index(drop=True, level=1).rename('Response1'))
df1['Response'] = df1.pop('Response1').fillna(df1['Response'])

mask = df1.index.duplicated(keep=False)
df1.loc[mask, 'Unique'] += df1[mask].groupby(level=0).cumcount().astype(str).radd('_')
df1 = df1.reset_index(drop=True)
print (df1)
              Response   Unique
0   I love it so much!    246_0
1      This is not bad  246_1_0
2   but can be better.  246_1_1
3           Well done!    247_0

EDIT: If need _0 for all another values remove mask:

s = df.loc[df.pop('Count') > 1, 'Response'].str.split(',', expand=True).stack()
df1 = df.join(s.reset_index(drop=True, level=1).rename('Response1'))
df1['Response'] = df1.pop('Response1').fillna(df1['Response'])

df1['Unique'] += df1.groupby(level=0).cumcount().astype(str).radd('_')
df1 = df1.reset_index(drop=True)
print (df1)
              Response   Unique
0   I love it so much!  246_0_0
1      This is not bad  246_1_0
2   but can be better.  246_1_1
3           Well done!  247_0_0
jezrael
  • 822,522
  • 95
  • 1,334
  • 1,252
  • 1
    Very neat, thank you! Did you add "_0" in the end in other Unique too while you edited a few minutes ago? I think it's a good idea :) – Lara19 Jul 08 '19 at 08:04
1

Step wise we can solve this problem the following:

  1. Split your dataframes by count
  2. Use this function to explode the string to rows
  3. We groupby on index and use cumcount to get the correct unique column values.
  4. Finally we concat the dataframes together again.

df1 = df[df['Count'].ge(2)] # all rows which have a count 2 or higher
df2 = df[df['Count'].eq(1)] # all rows which have count 1

df1 = explode_str(df1, 'Response', ',') # explode the string to rows on comma delimiter

# Create the correct unique column
df1['Unique'] = df1['Unique'] + '_' + df1.groupby(df1.index).cumcount().astype(str)

df = pd.concat([df1, df2]).sort_index().drop('Count', axis=1).reset_index(drop=True)
              Response   Unique
0   I love it so much!    246_0
1      This is not bad  246_1_0
2   but can be better.  246_1_1
3           Well done!    247_0

Function used from linked answer:

def explode_str(df, col, sep):
    s = df[col]
    i = np.arange(len(s)).repeat(s.str.count(sep) + 1)
    return df.iloc[i].assign(**{col: sep.join(s).split(sep)})
Erfan
  • 40,971
  • 8
  • 66
  • 78