0

I have a dataset of one column with 15000 unique Series_Ids. I want to create subset of the dataframe having 200 rows each and store it as separate dataframes. So there will be a total of 75 datasets.

I just cant think of how to approach this. One way I could do it is by indexing subsets of 200 rows by their row index but then I would have to do it 75 times.

Dont have any code as such. Im trying to make a function though.

realr
  • 3,652
  • 6
  • 23
  • 34
Rohan Jain
  • 25
  • 6

2 Answers2

2

If you want to store each subset as a separate dataframe, I cant think of any other way than looping 75 times. If I were you, I would loop through the original dataframe, grab 200 rows at a time, and store it as a dataframe in a dictionary as value whose key would be the loop number. Something like below:

dict_subsets = {}
for i in range(0, (15000/200)):
    row_start = i * 200
    row_end = row_start + 200
    df_curr = df_original.loc[row_start:row_end]
    dict_subsets['df_' + str(i)] = df_curr
Pavan Tej
  • 71
  • 5
  • This works for me. Thanks a lot. I just had to change dict_subsets['df_' + i] = df_curr to dict_subsets['df_' + str(i)] = df_curr cause it kept giving me an error cannot join str and int object – Rohan Jain Aug 02 '19 at 14:42
  • My bad, it should have been str(i) ofcourse. – Pavan Tej Aug 02 '19 at 15:10
  • 1
    Edited my original response for anyone referring to it in the future. @RohanJain Please upvote if you feel this helped your question. – Pavan Tej Aug 02 '19 at 15:18
0

You might be able to use numpy.split, since pandas DataFrames are mostly just numpy arrays:

import pandas as pd
import numpy as np

df = pd.DataFrame({"x": [1, 2, 3], "y": [4, 5, 6]})
print(df)
#    x  y
# 0  1  4
# 1  2  5
# 2  3  6
n = 3  # 200 for you
for df2 in np.split(df, n):
    print(df2)
#    x  y
# 0  1  4
#    x  y
# 1  2  5
#    x  y
# 2  3  6

It tries to make each chunk the same size. If such a split is not possible, an error is raised. You can avoid this by either manually adding empty rows (containing NaNs or similar) or by slicing it down to a multiple of 200.

Graipher
  • 6,891
  • 27
  • 47