1

I have a data-frame which looks similar to this (has about 300k rows):

df = DataFrame(dict(name = ['jon', 'jon', 'dany', 'dany', 'mindy', 'mindy', 'mindy'], 
          power = [1, 2, 2 ,4 ,5 ,5, 7], 
          rank =  ['a', 'b', 'c', 'd', 'r', 'a', 'g']))

which gives this :

enter image description here

what I want is a list of data-frames(subset) like this:

df_list = [df_1, df_2, df_3]

where df_1, df_2, df_3 are essentially these:

df_1 = df.query("name == 'jon'")
df_2 = df.query("name == 'dany'")
df_3 = df.query("name == 'mindy'")

In the dataset that I'm working with, there are about 500+ names. So how do I efficiently do this?

ultron
  • 442
  • 8
  • 16
  • Related and probable dupes https://stackoverflow.com/questions/41825281/how-to-split-dataframe-in-pandas and https://stackoverflow.com/questions/23691133/split-pandas-dataframe-based-on-groupby – EdChum Aug 24 '17 at 11:10

2 Answers2

4

Here's one way.

In [1497]: df_list = [x[1] for x in df.groupby('name', sort=False)]

In [1498]: df_list[0]
Out[1498]:
  name  power rank
0  jon      1    a
1  jon      2    b

In [1499]: df_list[1]
Out[1499]:
   name  power rank
2  dany      2    c
3  dany      4    d

In [1500]: df_list[2]
Out[1500]:
    name  power rank
4  mindy      5    r
5  mindy      5    a
6  mindy      7    g

But, it's better to store them as dict

In [1501]: {g: v for g, v in df.groupby('name', sort=False)}
Out[1501]:
{'dany':    name  power rank
 2  dany      2    c
 3  dany      4    d, 'jon':   name  power rank
 0  jon      1    a
 1  jon      2    b, 'mindy':     name  power rank
 4  mindy      5    r
 5  mindy      5    a
 6  mindy      7    g}

In [1502]: df_dict = {g: v for g, v in df.groupby('name', sort=False)}

In [1503]: df_dict['jon']
Out[1503]:
  name  power rank
0  jon      1    a
1  jon      2    b
Zero
  • 74,117
  • 18
  • 147
  • 154
1

You can try doing this

import pandas as pd
df = pd.DataFrame(dict(name = ['jon', 'jon', 'dany', 'dany', 'mindy', 'mindy', 'mindy'], 
          power = [1, 2, 2 ,4 ,5 ,5, 7], 
          rank =  ['a', 'b', 'c', 'd', 'r', 'a', 'g']))

dfs = []
for each in df.name.unique():
    dfs.append(df.loc[df.name == each,:])

Alternatively, you can use numpy to do this -

import numpy as np
dfs2 = []
array = df.values
for each in np.unique(array[:,0]):
    dfs2.append(pd.DataFrame(array[array[:,0] == each,:]))

Speed comparison between the above two methods -

import pandas as pd
import numpy as np
from time import time

df = pd.DataFrame(dict(name = ['jon', 'jon', 'dany', 'dany', 'mindy', 'mindy', 'mindy'], 
          power = [1, 2, 2 ,4 ,5 ,5, 7], 
          rank =  ['a', 'b', 'c', 'd', 'r', 'a', 'g']))

t0 = time()
dfs = []
for each in df.name.unique():
    dfs.append(df.loc[df.name == each,:])

t1 = time()
dfs2 = []
array = df.values
for each in np.unique(array[:,0]):
    dfs2.append(pd.DataFrame(array[array[:,0] == each,:]))

t2 = time()

t1 - t0 #0.003524303436279297
t2 - t1 #0.0016787052154541016

Numpy is faster and can be helpful in your case as you have a large dataset

Clock Slave
  • 7,627
  • 15
  • 68
  • 109