Split large Dataframe into smaller equal dataframes

Question

I have a large time-series dataframe. I would like to write a function that will arbitrarily split this large dataframe into N contiguous subperiods as new dataframes so that analysis may easily be done on each smaller dataframe.

I have this line of code that splits the large dataframe into even subperiods. I need a function that will output these split dataframes.

np.array_split(df, n) #n = arbitrary amount of new dataframes

I would like each new dataframe to be labeled as 1,2,3,4, etc. for each subperiod that it represents. So returning N number of dataframes that are all labeled according to their temporal nature of the initial large dataframe.

df before the function applied
 1    43.91 -0.041619
 2    43.39  0.011913
 3    45.56 -0.048801
 4    45.43  0.002857
 5    45.33  0.002204
 6    45.68 -0.007692
 7    46.37 -0.014992
 8    48.04 -0.035381
 9    48.38 -0.007053

3 new df's after function split applied 
df1
 1    43.91 -0.041619
 2    43.39  0.011913
 3    45.56 -0.048801
df2
 4    45.43  0.002857
 5    45.33  0.002204
 6    45.68 -0.007692
df3
 7    46.37 -0.014992
 8    48.04 -0.035381
 9    48.38 -0.007053

Please let me know if clarification is needed for any aspects. Thanks for the time!

Can you add some sample data with 10 rows and expected output for `chunkSize= 3` ? — jezrael, Sep 02 '19 at 07:44
Make up your mind. Do you have a *DataFrame* (probably with a single column) or a *Series*? — Valdi_Bo, Sep 02 '19 at 07:48
I revised a bit and added example of dataframe. I have a simple line of code that will split the DataFrame. — hkml, Sep 02 '19 at 07:53

score 9 · Accepted Answer · answered Sep 02 '19 at 08:17

I don't know from your description if you are aware that np.array_split outputs n objects. If it's only a few objects you could manually assign them, for example:

df1, df2, df3 = np.array_split(df, 3)

This would assign every subarray to these variables in order. Otherwise you could assign the series of subarrays to a single variable;

split_df = np.array_split(df, 3)
len(split_df)
# 3

then loop over this one variable and do your analysis per subarray. I would personally choose the latter.

for object in split_df:
    print(type(object))

This prints <class 'pandas.core.frame.DataFrame'> three times.

Just found this out! Thanks for the loop over tip - will make things go a lot more quickly. — hkml, Sep 02 '19 at 08:20

jezrael · Answer 2 · 2019-09-02T08:26:26.730

Use:

print (df)
       a         b
1  43.91 -0.041619
2  43.39  0.011913
3  45.56 -0.048801
4  45.43  0.002857
5  45.33  0.002204
6  45.68 -0.007692
7  46.37 -0.014992
8  48.04 -0.035381
9  48.38 -0.007053


def split(df, chunkSize = 30):
    return np.array_split(df, chunkSize)

It is possible, but not recommended:

for i, g in enumerate(split(df, 3), 1):
    globals()['df{}'.format(i)] =  g
print (df1)
       a         b
1  43.91 -0.041619
2  43.39  0.011913
3  45.56 -0.048801

Here better is select each DataFrame by indexing:

dfs = split(df, 3)
print (dfs[0])
       a         b
1  43.91 -0.041619
2  43.39  0.011913
3  45.56 -0.048801

Also is possible create dictionaries, but in my opinion really overcomplicated:

def split1(df, chunkSize = 30):
    return {'df_{}'.format(i): g 
              for i, g in enumerate(np.array_split(df, chunkSize), 1)}

dfs = split1(df, 3)
print (dfs)
{'df_1':        a         b
1  43.91 -0.041619
2  43.39  0.011913
3  45.56 -0.048801, 'df_2':        a         b
4  45.43  0.002857
5  45.33  0.002204, 'df_3':        a         b
6  45.68 -0.007692
7  46.37 -0.014992, 'df_4':        a         b
8  48.04 -0.035381
9  48.38 -0.007053}

print (dfs['df_1'])
       a         b
1  43.91 -0.041619
2  43.39  0.011913
3  45.56 -0.048801

Much appreciated for all the ways to get this done. – hkml Sep 02 '19 at 08:21 — hkml, Sep 02 '19 at 08:21

Split large Dataframe into smaller equal dataframes

2 Answers2

Linked

Related