3

I would like to take a dataframe and duplicate certain rows. One column, called name, may have multiple names. An example dataframe is contructed below:

data = [
    ['Joe', '17-11-2018', '2'],
    ['Karen', '17-11-2018', '4'],
    ['Bill, Avery', '17-11-2018', '6'],
    ['Sam', '18-11-2018', '4'],
    ['Alex, Frank', '18-11-2018', '6'],
    ['Chris', '18-11-2018', '8'],
]
df = pd.DataFrame(data, columns = ['name','date','number'])

This yields the following dataframe:

          name        date number
0          Joe  17-11-2018      2
1        Karen  17-11-2018      4
2  Bill, Avery  17-11-2018      6
3          Sam  18-11-2018      4
4  Alex, Frank  18-11-2018      6
5        Chris  18-11-2018      8

I would like to take all rows where there are multiple names (comma-separated) and duplicate them for each individual name. The resulting dataframe should look like this:

    name        date number
0    Joe  17-11-2018      2
1  Karen  17-11-2018      4
2   Bill  17-11-2018      6
3  Avery  17-11-2018      6
4    Sam  18-11-2018      4
5   Alex  18-11-2018      6
6  Frank  18-11-2018      6
7  Chris  18-11-2018      8
Jack Walsh
  • 562
  • 4
  • 14

4 Answers4

3

After str.split , it become a unnest problem

df['name']=df.name.str.split(',')

unnesting(df,['name'])
Out[97]: 
     name        date number
0     Joe  17-11-2018      2
1   Karen  17-11-2018      4
2    Bill  17-11-2018      6
2   Avery  17-11-2018      6
3     Sam  18-11-2018      4
4    Alex  18-11-2018      6
4   Frank  18-11-2018      6
5   Chris  18-11-2018      8

def unnesting(df, explode):
    idx = df.index.repeat(df[explode[0]].str.len())
    df1 = pd.concat([
        pd.DataFrame({x: np.concatenate(df[x].values)}) for x in explode], axis=1)
    df1.index = idx
    return df1.join(df.drop(explode, 1), how='left')
BENY
  • 317,841
  • 20
  • 164
  • 234
  • I receive `ValueError: count < 0` as a response – Jack Walsh May 14 '19 at 23:00
  • @JackWalsh did you try our method with the sample data you provided to us ? If that is work for you as well, however, failed with your actual data , you should tell us what is the different between the sample data and your real data – BENY May 14 '19 at 23:03
  • I realized that the difference is that my dataset has some null values in the `name` equivalent column. I have implemented my own solution based off of yours, and for this reason will make this as the answer. – Jack Walsh May 14 '19 at 23:33
1

Jack. I don't use dataframes much, but the following code should work before df = pd.DataFrame(data, columns = ['name','date','number'])

new_data = []
for item in data:
    if "," in item[0]:
        new_data.append([item[0].split(", ")[0], item[1], item[2]])
        new_data.append([item[0].split(", ")[1], item[1], item[2]])
    else:
        new_data.append(item)
TechPerson
  • 320
  • 1
  • 8
  • This would've worked well, but the dataframe I provided was an example. My real dataset is not created in the same way. I was able to use an approach similar to this using `.iterrows()` – Jack Walsh May 14 '19 at 23:34
1

Update 2023

All the answers have old methods, we now have Series.explode method, which can unnest a list. So the modern way to do this is:

df.assign(name=df["name"].str.split(", ")).explode("name", ignore_index=True)

    name        date number
0    Joe  17-11-2018      2
1  Karen  17-11-2018      4
2   Bill  17-11-2018      6
3  Avery  17-11-2018      6
4    Sam  18-11-2018      4
5   Alex  18-11-2018      6
6  Frank  18-11-2018      6
7  Chris  18-11-2018      8

Old answer

For a string with a separator you can use the following function found in this answer:

def explode_str(df, col, sep):
    s = df[col]
    i = np.arange(len(s)).repeat(s.str.count(sep) + 1)
    return df.iloc[i].assign(**{col: sep.join(s).split(sep)})

explode_str(df, 'name', ',')

     name        date number
0     Joe  17-11-2018      2
1   Karen  17-11-2018      4
2    Bill  17-11-2018      6
2   Avery  17-11-2018      6
3     Sam  18-11-2018      4
4    Alex  18-11-2018      6
4   Frank  18-11-2018      6
5   Chris  18-11-2018      8
Erfan
  • 40,971
  • 8
  • 66
  • 78
  • I receive the following error... `ValueError: count < 0` on the line which defines `i` – Jack Walsh May 14 '19 at 22:57
  • On your original data or your example dataframe? Because i'm not getting an error @JackWalsh – Erfan May 14 '19 at 23:02
  • I realized my original dataframe had a null value that was giving me an issue. While your solution and @WeNYoBen 's solutions were very similar, his was clearer to me. – Jack Walsh May 14 '19 at 23:36
0

I believe I read something like this. As soon as I locate the link, I will share it.

from itertools import chain
import time
from numpy.matlib import randn
from pandas import DataFrame as df
import numpy as np
import pandas as pd 
from itertools import chain
from numpy.matlib import randn
from pandas import DataFrame as df
import numpy as np
import pandas as pd 
import re

data = [
    ['J', '17-11-2018', '2'],
    ['K', '17-11-2018', '4'],
    ['B, A', '17-11-2018', '6'],
    ['S', '18-11-2018', '4'],
    ['L, F', '18-11-2018', '6'],
    ['C', '18-11-2018', '8'],
]
df = pd.DataFrame(data, columns = ['P','Q','R'])
print(df)

"""
      P           Q  R
0     J  17-11-2018  2
1     K  17-11-2018  4
2  B, A  17-11-2018  6
3     S  18-11-2018  4
4  L, F  18-11-2018  6
5     C  18-11-2018  8
                    
"""


m1 = (lambda col:pd.Series(col).str.split(','))
aa = df.set_index(['R','Q']).apply(m1)
print(aa)
"""
                   P
R Q                  
2 17-11-2018      [J]
4 17-11-2018      [K]
6 17-11-2018  [B,  A]
4 18-11-2018      [S]
6 18-11-2018  [L,  F]
8 18-11-2018      [C]

"""
res = aa.explode('P').reset_index().reindex(df.columns,axis=1)
print(res)

"""
  P           Q  R
0   J  17-11-2018  2
1   K  17-11-2018  4
2   B  17-11-2018  6
3   A  17-11-2018  6
4   S  18-11-2018  4
5   L  18-11-2018  6
6   F  18-11-2018  6
7   C  18-11-2018  8

"""
Soudipta Dutta
  • 1,353
  • 1
  • 12
  • 7