3

I have a large time-series dataframe. I would like to write a function that will arbitrarily split this large dataframe into N contiguous subperiods as new dataframes so that analysis may easily be done on each smaller dataframe.

I have this line of code that splits the large dataframe into even subperiods. I need a function that will output these split dataframes.

np.array_split(df, n) #n = arbitrary amount of new dataframes

I would like each new dataframe to be labeled as 1,2,3,4, etc. for each subperiod that it represents. So returning N number of dataframes that are all labeled according to their temporal nature of the initial large dataframe.

df before the function applied
 1    43.91 -0.041619
 2    43.39  0.011913
 3    45.56 -0.048801
 4    45.43  0.002857
 5    45.33  0.002204
 6    45.68 -0.007692
 7    46.37 -0.014992
 8    48.04 -0.035381
 9    48.38 -0.007053

3 new df's after function split applied 
df1
 1    43.91 -0.041619
 2    43.39  0.011913
 3    45.56 -0.048801
df2
 4    45.43  0.002857
 5    45.33  0.002204
 6    45.68 -0.007692
df3
 7    46.37 -0.014992
 8    48.04 -0.035381
 9    48.38 -0.007053

Please let me know if clarification is needed for any aspects. Thanks for the time!

hkml
  • 339
  • 3
  • 12
  • Can you add some sample data with 10 rows and expected output for `chunkSize= 3` ? – jezrael Sep 02 '19 at 07:44
  • Make up your mind. Do you have a *DataFrame* (probably with a single column) or a *Series*? – Valdi_Bo Sep 02 '19 at 07:48
  • I revised a bit and added example of dataframe. I have a simple line of code that will split the DataFrame. – hkml Sep 02 '19 at 07:53

2 Answers2

9

I don't know from your description if you are aware that np.array_split outputs n objects. If it's only a few objects you could manually assign them, for example:

df1, df2, df3 = np.array_split(df, 3)

This would assign every subarray to these variables in order. Otherwise you could assign the series of subarrays to a single variable;

split_df = np.array_split(df, 3)
len(split_df)
# 3

then loop over this one variable and do your analysis per subarray. I would personally choose the latter.

for object in split_df:
    print(type(object))

This prints <class 'pandas.core.frame.DataFrame'> three times.

Ronny Efronny
  • 1,148
  • 9
  • 28
  • Just found this out! Thanks for the loop over tip - will make things go a lot more quickly. – hkml Sep 02 '19 at 08:20
2

Use:

print (df)
       a         b
1  43.91 -0.041619
2  43.39  0.011913
3  45.56 -0.048801
4  45.43  0.002857
5  45.33  0.002204
6  45.68 -0.007692
7  46.37 -0.014992
8  48.04 -0.035381
9  48.38 -0.007053


def split(df, chunkSize = 30):
    return np.array_split(df, chunkSize)

It is possible, but not recommended:

for i, g in enumerate(split(df, 3), 1):
    globals()['df{}'.format(i)] =  g
print (df1)
       a         b
1  43.91 -0.041619
2  43.39  0.011913
3  45.56 -0.048801

Here better is select each DataFrame by indexing:

dfs = split(df, 3)
print (dfs[0])
       a         b
1  43.91 -0.041619
2  43.39  0.011913
3  45.56 -0.048801

Also is possible create dictionaries, but in my opinion really overcomplicated:

def split1(df, chunkSize = 30):
    return {'df_{}'.format(i): g 
              for i, g in enumerate(np.array_split(df, chunkSize), 1)}

dfs = split1(df, 3)
print (dfs)
{'df_1':        a         b
1  43.91 -0.041619
2  43.39  0.011913
3  45.56 -0.048801, 'df_2':        a         b
4  45.43  0.002857
5  45.33  0.002204, 'df_3':        a         b
6  45.68 -0.007692
7  46.37 -0.014992, 'df_4':        a         b
8  48.04 -0.035381
9  48.38 -0.007053}

print (dfs['df_1'])
       a         b
1  43.91 -0.041619
2  43.39  0.011913
3  45.56 -0.048801
jezrael
  • 822,522
  • 95
  • 1,334
  • 1,252