Pandas - split large excel file

Question

I have an excel file with about 500,000 rows and I want to split it to several excel file, each with 50,000 rows.

I want to do it with pandas so it will be the quickest and easiest.

any ideas how to make it?

thank you for your help

does you excel file have only one sheet with data? – MaxU - stand with Ukraine Dec 25 '16 at 12:25 — MaxU - stand with Ukraine, Dec 25 '16 at 12:25

score 10 · Answer 1 · edited Jul 17 '20 at 10:31

10

Assuming that your Excel file has only one (first) sheet containing data, I'd make use of chunksize parameter:

import pandas as pd
import numpy as np

i=0
for df in pd.read_excel(file_name, chunksize=50000):
    df.to_excel('/path/to/file_{:02d}.xlsx'.format(i), index=False)
    i += 1

UPDATE:

chunksize = 50000
df = pd.read_excel(file_name)
for chunk in np.split(df, len(df) // chunksize):
    chunk.to_excel('/path/to/file_{:02d}.xlsx'.format(i), index=False)

edited Jul 17 '20 at 10:31

mohammadreza berneti

529
13
21

answered Dec 25 '16 at 12:29

MaxU - stand with Ukraine

205,989
36
386
419

1

sorry for the delay but for some reason it raises an error which says `Reading an Excel file in chunks is not implemented` any ideas? – TheDaJon Jan 09 '17 at 12:12
@TheDaJon, what is your pandas version: `pd.__version__`? – MaxU - stand with Ukraine Jan 09 '17 at 12:13
0.17.1 this is my version – TheDaJon Jan 09 '17 at 12:22
1

I was happy too fast... it does split the file, though the chunks are always bigger than what I choosed. any idea why? – TheDaJon Jan 11 '17 at 07:23
2

array split does not result in an equal division, when there are odd number of records – sudhansu63 Sep 20 '19 at 10:36

score 3 · Answer 2 · answered Sep 07 '20 at 00:30

3

use np.split_array as per this answer https://stackoverflow.com/a/17315875/1394890 if you get

array split does not result in an equal division

answered Sep 07 '20 at 00:30

wild

311
4
11

score 1 · Answer 3 · answered Mar 22 '18 at 16:13

1

As explained by MaxU, I will also make use of a variable chunksize and divide the total number of rows in large file into required number of rows.

import pandas as pd
import numpy as np

chunksize = 50000
i=0
df = pd.read_excel("path/to/file.xlsx")
for chunk in np.split(df, len(df) // chunksize):
    chunk.to_excel('path/to/destination/folder/file_{:02d}.xlsx'.format(i), index=True)
    i += 1

Hope this would help you.

answered Mar 22 '18 at 16:13

Tarun Balani

11
2

you are joking man ... – graj499 Nov 21 '22 at 10:58

score 1 · Answer 4 · edited Oct 25 '21 at 12:57

1

import pandas as pd
l = pd.read_excel("inputfilename.xlsx")
total_size = 500,000
chunk_size = 50000
for i in range(0, total_size, chunk_size):
    df = l[i:i+chunk_size]
    df.to_excel(str(i)+"outputfilename.xlsx")

edited Oct 25 '21 at 12:57

Adeel Afzal

191
1
4

answered Aug 30 '21 at 10:26

user16005292

41
3

Please [edit] your post to include a explanation – mousetail Oct 26 '21 at 07:07

score 0 · Answer 5 · answered Feb 26 '22 at 19:23

I wrote a function for this:

import numpy as np
import pandas as pd

def split_excel(file_name, n):  # n: number of chunks or parts (number of outputed excel files)
    df = pd.read_excel(file_name)
    l = len(df)
    c = l // n # c: number of rows
    r = l % c

    if r != 0:  # if it is not divisible
        df[-r:].to_excel(f'part_{l//c+1}.xlsx', index=False)
        df = df[:-r]

    i = 0
    for part in np.split(df, l//c):
        part.to_excel(f'part_{i}.xlsx', index=False)
        i += 1
        
        
split_excel('my_file.xlsx')

Pandas - split large excel file

5 Answers5

Linked