How to create multiple dataframes with no repetition from one dataframe?

Question

I have a dataframe like as given below

df = pd.DataFrame({
'subject_id':[1,1,1,1,2,2,2,2,3,3,4,4,4,4,4],
'readings' : ['READ_1','READ_2','READ_1','READ_3','READ_1','READ_5','READ_6','READ_8','READ_10','READ_12','READ_11','READ_14','READ_09','READ_08','READ_07'],
 'val' :[5,6,7,11,5,7,16,12,13,56,32,13,45,43,46],
 })

What I would like to do is create multiple dataframes out of this (df1,df2). In real time it doesn't have to be two. It can be 10 or 20 depending on my data size. I ask this because i intend to do parallel processing. I will divide my one huge df to multiple small dataframes and do parallel processing

for example df1 should contain all the records of 2(two) subjects and df2 should contain all the records of remaining 2 subjects

Tried this but it isn't right

grouped = df.groupby('subject_id')
df1 = grouped.filter(lambda x: x['subject_id']== 2)

I expect my output to be like this

df1 - contains all records of 2 subjects. In real time, I wish to select 100 subjects and would like to have all their records in one dataframe

df2 - contains all records of 2 subjects. But in real time, I wish to select 100 subjects and have all their records in one dataframe

As you can see there is clear segregation of data based on subjects and no presence of data of a subject in multiple dataframe. Like subject_id = 1 has data only in df1

updated post

Both the answers below were good and provides the expected output. I don't know which is best. I can mark only one and go with Jezrael's answer as it had more upvotes. Nonetheless both the answers were upvoted — The Great, Oct 18 '19 at 10:20

jezrael · Accepted Answer · 2019-10-18T10:16:06.260

2

Create list of DataFrames with groupby and

if subject_id are numbers from 1 to N:

dfs = [x for _,x in df.groupby((df['subject_id'] - 1) // N)]

General solution working with any values in subject_id - e.g. strings:

use factorize for array integer divided by number of subject_id:

N = 2

print (pd.factorize(df['subject_id'])[0])
[0 0 0 0 1 1 1 1 2 2 3 3 3 3 3]

print (pd.factorize(df['subject_id'])[0] // N )
[0 0 0 0 0 0 0 0 1 1 1 1 1 1 1]
dfs = [x for _,x in df.groupby(pd.factorize(df['subject_id'])[0] // N)]
print (dfs)
[   subject_id readings  val
0           1   READ_1    5
1           1   READ_2    6
2           1   READ_1    7
3           1   READ_3   11
4           2   READ_1    5
5           2   READ_5    7
6           2   READ_6   16
7           2   READ_8   12,     subject_id readings  val
8            3  READ_10   13
9            3  READ_12   56
10           4  READ_11   32
11           4  READ_14   13
12           4  READ_09   45
13           4  READ_08   43
14           4  READ_07   46]

print (dfs[0])
   subject_id readings  val
0           1   READ_1    5
1           1   READ_2    6
2           1   READ_1    7
3           1   READ_3   11
4           2   READ_1    5
5           2   READ_5    7
6           2   READ_6   16
7           2   READ_8   12

edited Oct 18 '19 at 10:16

answered Oct 18 '19 at 10:03

jezrael

822,522
95
1,334
1,252

What's `factorize` for? Why not `[chunk for _, chunk in df.groupby((df['subject_id']-1) // 2)]` – Dan Oct 18 '19 at 10:08
1

@Dan - Just added – jezrael Oct 18 '19 at 10:09
Can I know how does `divide by N` works here? `pd.factorize(df['subject_id'])[0] // N `? – The Great Oct 18 '19 at 10:12
@SSMK - Can you check [this](https://stackoverflow.com/a/183870) ? – jezrael Oct 18 '19 at 10:14
Thank you. What does this underscore in `x for _,x` imply? – The Great Oct 18 '19 at 10:16
@SSMK - It is DataFrame by loop `groupby`, `_` is used for not important variable, here it is values from factorize. You can check it `print ([i for i,x in df.groupby(pd.factorize(df['subject_id'])[0] // 2)])` - here is return `0, 1` from sample data, because 2 groups – jezrael Oct 18 '19 at 10:19
Hi Jaezral, Can I check with you whether `N` is just used to indicate the no of resultant dataframes that we would like to create. Am I right? It's not about how many subjects in each group – The Great Oct 21 '19 at 04:02
It can be 20 or 200 or 2000 based on my requirement. Doesn't have anything to do with the `subject_id`. Am I right? Or `//N` will create a problem? – The Great Oct 21 '19 at 04:04
Is it possible to do without `dividing`. Because, sometimes my `subject_ids` are like as `12345678`,`234567890`,`55555`,`345678` – The Great Oct 21 '19 at 04:12
In a case, like this, it all becomes part of the same dataframe. – The Great Oct 21 '19 at 04:13
@SSMK - It is nomber of unique `subject_id` pre each DataFrame, if number is not integers like `1,2,3,...` then use general solution with `factorize` - it convert your `subject_id` to `0,1,2...` values first – jezrael Oct 21 '19 at 05:02

ansev · Answer 2 · 2019-10-18T10:11:32.803

I thin you need pd.cut to make groups:

n=2
groups=pd.cut(df['subject_id'],n)
for i , group in df.groupby(groups):
    print(group)

   subject_id readings  val
0           1   READ_1    5
1           1   READ_2    6
2           1   READ_1    7
3           1   READ_3   11
4           2   READ_1    5
5           2   READ_5    7
6           2   READ_6   16
7           2   READ_8   12
    subject_id readings  val
8            3  READ_10   13
9            3  READ_12   56
10           4  READ_11   32
11           4  READ_14   13
12           4  READ_09   45
13           4  READ_08   43
14           4  READ_07   46

If you want you can save in a list:

#savig in a list
df_list=[group for i , group in df.groupby(groups)]

How to create multiple dataframes with no repetition from one dataframe?

2 Answers2