0

I am struggling with trying to split my dataframe into 3 new dataframes where the split occurs when the supplier name changes. I have searched for existing questions. How to split dataframe on based on columns row and Pandas & python: split dataframe into many dataframes based on column value containing substring came close but I cannot get the output that I want.

Here is attached a toy dataset to illustrate my problem:

df = pd.DataFrame({'Supplier': ['Supplier1', 'Supplier1', 'Supplier2', 'Supplier2', 'Supplier2', 'Supplier3','Supplier3'], 'Class' : ['A', 'A','A','A','A','B','B']})

and I tried (unsuccessfully)

df1 = df.iloc[:df.index[df['Supplier'] == 'Supplier1'].tolist()[0]]
df2 = df.iloc[df.index[df['Supplier'] == 'Supplier2'].tolist()[0]+1:]
df3 = df.iloc[df.index[df['Supplier'] == 'Supplier3'].tolist()[0]+1:]

The result that I am trying to achieve is:

   Supplier Class
0  Supplier1     A
1  Supplier1     A
    Supplier Class
0  Supplier2     A
1  Supplier2     A
2  Supplier2     A
    Supplier Class
0  Supplier3     B
1  Supplier3     B

Any help on this would be greatly appreciated. Thank you!

UPDATE: Using:

df1 = {i:group for i,group in df.groupby( df['Supplier'].ne(df['Supplier'].shift()).cumsum() )}

gives:

{1:     Supplier Class
0  Supplier1     A
1  Supplier1     A, 2:     Supplier Class
2  Supplier2     A
3  Supplier2     A
4  Supplier2     A, 3:     Supplier Class
5  Supplier3     B
6  Supplier3     B}

I need to split into separated dataframe so I did:

df3 = pd.DataFrame.from_dict({i:group for i,group in df1.groupby(df1['Supplier'].ne(df1['Supplier'].shift()).cumsum() )},orient='index', columns= ['Class'])

but it gives the error

 df3 = pd.DataFrame.from_dict({i:group for i,group in df1.groupby(df1['Supplier'].ne(df1['Supplier'].shift()).cumsum() )},orient='index', columns= ['Class'])
AttributeError: 'dict' object has no attribute 'groupby'
Murcielago
  • 905
  • 1
  • 8
  • 30

3 Answers3

3

to create dataframe for unique supplier values:

dict(zip(df.groupby('Supplier')))

to create dataframe every time the value in the supplier column changes:

dfs = {i:group.reset_index(drop=True) 
       for i,group in df.groupby( df['Supplier'].ne(df['Supplier'].shift()).cumsum() )}

Update

obtaining three separate dataframes is incompatible with wanting to use pd.DataFrame(..) in the end, this will obviously create a unique dataframe, therefore my solution was to create a dataframe dictionary where each one is accessed for an integer value of 1 to n. We can reset the index for each of them, simply by doing:

{i:group.reset_index(drop=True) for i,group in df.groupby( df['supplier'].ne(df['supplier'].shift()).cumsum() )}

we can obtain a single data frame with the index restored every time there is a change in value in the provider column using pd.concat as @anky_91 suggested

dfs_concat = pd.concat([group.reset_index(drop=True) 
                        for _,group in df.groupby( df['Supplier'].ne(df['Supplier'].shift())
                                                                 .cumsum() )])
print(dfs_concat)

    Supplier Class
0  Supplier1     A
1  Supplier1     A
0  Supplier2     A
1  Supplier2     A
2  Supplier2     A
0  Supplier3     B
1  Supplier3     B

but if the latter is the solution sought we could simply use groupby.cumcount

df.index = df.groupby(df['Supplier'].ne(df['Supplier'].shift()).cumsum()).cumcount()
print(df)

  Supplier Class
0  Supplier1     A
1  Supplier1     A
0  Supplier2     A
1  Supplier2     A
2  Supplier2     A
0  Supplier3     B
1  Supplier3     B
ansev
  • 30,322
  • 5
  • 17
  • 31
  • this second does the work! but gives a dictionary, pd.DataFrame.from_dict does not work – Murcielago Jan 01 '20 at 17:08
  • could you show the output of `pd.DataFrame.from_dict` editing your question? thanks !:) – ansev Jan 01 '20 at 17:13
  • 1
    another one matching the index which OP shows is `pd.concat([i.reset_index(drop=True) for _,i in df.groupby('Supplier')])` , but not sure why do they have the columns for each iteration as a row – anky Jan 01 '20 at 17:32
0

Try this,

df = pd.DataFrame({'Supplier': ['Supplier1', 'Supplier1', 'Supplier2', 'Supplier2', 'Supplier2', 'Supplier3','Supplier3'], 'Class' : ['A', 'A','A','A','A','B','B']})


df1 = df[df.Supplier=='Supplier1']
df2 = df[df.Supplier=='Supplier2']
df3 = df[df.Supplier=='Supplier3']

Or you can do,

new_df=df.pivot(columns='Supplier')

to get many columns if you have 'suppliers'.

Output:

Supplier Supplier1 Supplier2 Supplier3
0                A       NaN       NaN
1                A       NaN       NaN
2              NaN         A       NaN
3              NaN         A       NaN
4              NaN         A       NaN
5              NaN       NaN         B
6              NaN       NaN         B
merit_2
  • 461
  • 5
  • 16
0

I believe this achieves the splitting you want:

groups = [group.reset_index()[['Supplier', 'Class']] for _, group in df.groupby('Supplier')]

You can get the exact output of your example with

for group in groups:
    print(group)

Output:

    Supplier Class
0  Supplier1     A
1  Supplier1     A
    Supplier Class
0  Supplier2     A
1  Supplier2     A
2  Supplier2     A
    Supplier Class
0  Supplier3     B
1  Supplier3     B
chthonicdaemon
  • 19,180
  • 2
  • 52
  • 66
  • This does not provide any new ideas, nor does it give the desired output because you have not set drop = True – ansev Jan 01 '20 at 18:17