Split a large pandas dataframe

Question

I have a large dataframe with 423244 lines. I want to split this in to 4. I tried the following code which gave an error? ValueError: array split does not result in an equal division

for item in np.split(df, 4):
    print item

How to split this dataframe in to 4 groups?

We want a `np.split(df, N)` function please. – Soerendip Oct 22 '18 at 23:38 — Soerendip, Oct 22 '18 at 23:38

score 339 · Accepted Answer · edited Jun 26 '22 at 19:55

339

Use np.array_split:

Docstring:
Split an array into multiple sub-arrays.

Please refer to the ``split`` documentation.  The only difference
between these functions is that ``array_split`` allows
`indices_or_sections` to be an integer that does *not* equally
divide the axis.

In [1]: import pandas as pd

In [2]: df = pd.DataFrame({'A' : ['foo', 'bar', 'foo', 'bar',
   ...:                           'foo', 'bar', 'foo', 'foo'],
   ...:                    'B' : ['one', 'one', 'two', 'three',
   ...:                           'two', 'two', 'one', 'three'],
   ...:                    'C' : randn(8), 'D' : randn(8)})

In [3]: print df
     A      B         C         D
0  foo    one -0.174067 -0.608579
1  bar    one -0.860386 -1.210518
2  foo    two  0.614102  1.689837
3  bar  three -0.284792 -1.071160
4  foo    two  0.843610  0.803712
5  bar    two -1.514722  0.870861
6  foo    one  0.131529 -0.968151
7  foo  three -1.002946 -0.257468

In [4]: import numpy as np
In [5]: np.array_split(df, 3)
Out[5]: 
[     A    B         C         D
0  foo  one -0.174067 -0.608579
1  bar  one -0.860386 -1.210518
2  foo  two  0.614102  1.689837,
      A      B         C         D
3  bar  three -0.284792 -1.071160
4  foo    two  0.843610  0.803712
5  bar    two -1.514722  0.870861,
      A      B         C         D
6  foo    one  0.131529 -0.968151
7  foo  three -1.002946 -0.257468]

edited Jun 26 '22 at 19:55

ChrisGPT was on strike

127,765
105
273
257

answered Jun 26 '13 at 09:07

root

76,608
25
108
120

Thanks Very Much! In addition to this I want to apply some function to each group?How to access groups one by one? – Nilani Algiriyage Jun 26 '13 at 09:13
13

@NilaniAlgiriyage -- `array_split` returns a list of DataFrames, so you can just loop through the list... – root Jun 26 '13 at 09:16
I'm splitting the dataframe since it is too large. I want to take the first group and apply the function, then the second group and apply function etc. so how do I access each group? – Nilani Algiriyage Jun 26 '13 at 09:23
@NilaniAlgiriyage -- I am not sure what do you mean. You can access each df the same way you would any list element... If you continue having problems, please ask a new question and specify the function that you are trying to use on the dfs... – root Jun 26 '13 at 09:34
Thanks root,I have accepted your answer :)!..This is what I wanted...group = np.array_split(df, 3), group[0].apply(func)etc. – Nilani Algiriyage Jun 26 '13 at 09:49
1

How do you not get an AttributeError as Dataframe has no 'size'. – Boosted_d16 Nov 06 '14 at 17:37
2

This answer is outdated: `AttributeError: 'DataFrame' object has no attribute 'size'` – Tjorriemorrie Jan 29 '15 at 12:34
1

Answer works great! Very tight and concise (e.g. Pythonic) for looping over chunks of data frames when doing heavy computations – openwonk Jul 21 '17 at 18:05
1

why does this work? i realize `np.split` works on arrays and dataframe is an np array under the hood but how come the function returns a list of dataframes instead of arrays.. seems a bit magical? – Alex Sep 18 '18 at 23:44
Any ideas on the limit of rows to use the Numpy array_split method? I have a dataframe with +6m rows and would like to split it in 20 or so chunks but his method just appears to be taking forever. Thanks – Jo Costa Jan 29 '19 at 11:29
@root , can i split df by ratio? for example, 6:1 – sailfish009 Sep 28 '19 at 00:19
want to split a big dataframe in to a smaller dataframes of 50 records each , how to achieve this ? – Sithija Piyuman Thewa Hettige Dec 08 '21 at 13:50

score 82 · Answer 2 · edited Jul 18 '20 at 14:39

82

I wanted to do the same, and I had first problems with the split function, then problems with installing pandas 0.15.2, so I went back to my old version, and wrote a little function that works very well. I hope this can help!

# input - df: a Dataframe, chunkSize: the chunk size
# output - a list of DataFrame
# purpose - splits the DataFrame into smaller chunks
def split_dataframe(df, chunk_size = 10000): 
    chunks = list()
    num_chunks = len(df) // chunk_size + 1
    for i in range(num_chunks):
        chunks.append(df[i*chunk_size:(i+1)*chunk_size])
    return chunks

edited Jul 18 '20 at 14:39

Daniel Dror

2,304
28
30

answered Mar 05 '15 at 15:49

elixir

909
8
8

8

much faster than using np.array_split() – jgaw Jan 27 '16 at 19:15
13

The correct way to calculate numberChunks import math numberChunks = math.ceil(len(df) / chunkSize) – Sergey Leyko Feb 29 '20 at 16:59
9

@SergeyLeyko is correct. Otherwise you get an empty dataframe at the end of the `chunks` list when `df` size is equally divicible by chunk size. Here's an alternative for num_chunks `num_chunks = len(df) // chunk_size + (1 if len(df) % chunk_size else 0)` – Teshan Shanuka J Feb 01 '21 at 14:17
If `df.index` is not consecutive, `.iloc` has to be used, like in [this answer](https://stackoverflow.com/a/49415851/1116842). – moi Oct 20 '22 at 14:03

score 46 · Answer 3 · edited Oct 06 '20 at 18:06

46

Be aware that np.array_split(df, 3) splits the dataframe into 3 sub-dataframes, while the split_dataframe function defined in @elixir's answer, when called as split_dataframe(df, chunk_size=3), splits the dataframe every chunk_size rows.

Example:

With np.array_split:

df = pd.DataFrame([1,2,3,4,5,6,7,8,9,10,11], columns=['TEST'])
df_split = np.array_split(df, 3)

...you get 3 sub-dataframes:

df_split[0] # 1, 2, 3, 4
df_split[1] # 5, 6, 7, 8
df_split[2] # 9, 10, 11

With split_dataframe:

df_split2 = split_dataframe(df, chunk_size=3)

...you get 4 sub-dataframes:

df_split2[0] # 1, 2, 3
df_split2[1] # 4, 5, 6
df_split2[2] # 7, 8, 9
df_split2[3] # 10, 11

Hope I'm right, and that this is useful.

edited Oct 06 '20 at 18:06

Mack

2,614
2
21
33

answered Jul 12 '17 at 10:06

Gilberto

813
7
17

is there an easy way to make this process random. I can only think of adding a rondom column, splitting and removing the random column but there might be an easier way – Rutger Hofste Feb 19 '18 at 14:42
do they have to be equal chunk sizes? – InquilineKea Mar 04 '18 at 05:14
1

Thank you for your clarification. Take my upvote! – Edgecase Jun 24 '21 at 23:00
Thanks @Gilbertor! To which library does the module "split_dataframe" belong? I can't import it neither find a library which it belongs to... – soph Jun 19 '23 at 09:46

score 33 · Answer 4 · answered Mar 21 '18 at 20:22

33

I guess now we can use plain iloc with range for this.

chunk_size = int(df.shape[0] / 4)
for start in range(0, df.shape[0], chunk_size):
    df_subset = df.iloc[start:start + chunk_size]
    process_data(df_subset)
    ....

answered Mar 21 '18 at 20:22

pratpor

1,954
1
27
46

Also consider [this related answer](https://stackoverflow.com/a/65800824/) which does the same using the `more_itertools` package. – Asclepius Jul 20 '21 at 23:20

score 11 · Answer 5 · answered Oct 12 '20 at 14:37

11

you can use list comprehensions to do this in a single line

n = 4
chunks = [df[i:i+n] for i in range(0,df.shape[0],n)]

answered Oct 12 '20 at 14:37

LucyDrops

539
5
15

1

But also see the important note about COPY vs VIEW in my answer below: https://stackoverflow.com/a/69409948/6939217 – drkr Oct 01 '21 at 17:36

score 8 · Answer 6 · edited Jan 13 '17 at 08:47

8

Caution:

np.array_split doesn't work with numpy-1.9.0. I checked out: It works with 1.8.1.

Error:

Dataframe has no 'size' attribute

edited Jan 13 '17 at 08:47

Fruchtzwerg

10,999
12
40
49

answered Nov 17 '14 at 16:15

yemu

26,249
10
32
29

7

i filed a bug in the pandas github: https://github.com/pydata/pandas/issues/8846 seems it is already fixed for pandas 0.15.2 – yemu Dec 17 '14 at 11:00

score 7 · Answer 7 · answered Aug 18 '21 at 09:24

building on @elixir's answer...
I'd suggest using a generator to avoid loading all the chunks in memory:

def chunkit(df, chunk_size = 10000): 
    num_chunks = len(df) // chunk_size
    if len(df) % chunk_size != 0:
        num_chunks += 1
    for i in range(num_chunks):
        yield df[i*chunk_size:(i + 1) * chunk_size]

drkr · Answer 8 · 2022-02-20T22:05:28.937

7

I like a one-liners, so @LucyDrops answer works for me.

However, there is one important thing: add a .copy() if chunks should be COPY of original df parts:

chunks = [df[i:i+n].copy() for i in range(0,df.shape[0],n)]

Otherwise there is a high chance to receive the next warning during the further processing of chunks (in loop for example):

A value is trying to be set on a copy of a slice from a DataFrame.

(see the details in the Pandas documentation)

edited Feb 20 '22 at 22:05

answered Oct 01 '21 at 17:35

drkr

186
2
7

1

From the link you posted "it’s very hard to predict whether it will return a view or a copy (it depends on the memory layout of the array, about which pandas makes no guarantees)" – Joey Baruch Dec 29 '21 at 16:58
@JoeyBaruch, the answer has been updated. Thanks – drkr Feb 20 '22 at 22:08
In Pandas, when warned about trying to use a copy, I've found the problem was always because a copy was NOT being used and solution is always to add a `.copy()`. If that dev would just change it to `on a non-copy`, whatever the deal is, just make it less unclear, would be appreciated, and add the currently missing line number – gseattle Jul 25 '22 at 12:05

score 5 · Answer 9 · answered Sep 21 '17 at 20:40

You can use groupby, assuming you have an integer enumerated index:

import math
df = pd.DataFrame(dict(sample=np.arange(99)))
rows_per_subframe = math.ceil(len(df) / 4.)

subframes = [i[1] for i in df.groupby(np.arange(len(df))//rows_per_subframe)]

Note: groupby returns a tuple in which the 2nd element is the dataframe, thus the slightly complicated extraction.

>>> len(subframes), [len(i) for i in subframes]
(4, [25, 25, 25, 24])

score 1 · Answer 10 · edited Jul 19 '21 at 22:02

1

I also experienced np.array_split not working with Pandas DataFrame. My solution was to only split the index of the DataFrame and then introduce a new column with the "group" label:

indexes = np.array_split(df.index,N, axis=0)
for i,index in enumerate(indexes):
   df.loc[index,'group'] = i

This makes groupby operations very convenient, such as for calculating the mean value of each group:

df.groupby(by='group').mean()

edited Jul 19 '21 at 22:02

Asclepius

57,944
17
167
143

answered Mar 20 '20 at 08:46

Martin Alexandersson

1,269
10
12

Anyone using `.loc` over `.iloc` should definitely remain aware of the difference between the two. – Asclepius Jul 19 '21 at 22:04

Split a large pandas dataframe

10 Answers10

Linked

Related