I have an excel file with about 500,000 rows and I want to split it to several excel file, each with 50,000 rows.
I want to do it with pandas so it will be the quickest and easiest.
any ideas how to make it?
thank you for your help
I have an excel file with about 500,000 rows and I want to split it to several excel file, each with 50,000 rows.
I want to do it with pandas so it will be the quickest and easiest.
any ideas how to make it?
thank you for your help
Assuming that your Excel file has only one (first) sheet containing data, I'd make use of chunksize
parameter:
import pandas as pd
import numpy as np
i=0
for df in pd.read_excel(file_name, chunksize=50000):
df.to_excel('/path/to/file_{:02d}.xlsx'.format(i), index=False)
i += 1
UPDATE:
chunksize = 50000
df = pd.read_excel(file_name)
for chunk in np.split(df, len(df) // chunksize):
chunk.to_excel('/path/to/file_{:02d}.xlsx'.format(i), index=False)
use np.split_array as per this answer https://stackoverflow.com/a/17315875/1394890 if you get
array split does not result in an equal division
As explained by MaxU, I will also make use of a variable chunksize and divide the total number of rows in large file into required number of rows.
import pandas as pd
import numpy as np
chunksize = 50000
i=0
df = pd.read_excel("path/to/file.xlsx")
for chunk in np.split(df, len(df) // chunksize):
chunk.to_excel('path/to/destination/folder/file_{:02d}.xlsx'.format(i), index=True)
i += 1
Hope this would help you.
import pandas as pd
l = pd.read_excel("inputfilename.xlsx")
total_size = 500,000
chunk_size = 50000
for i in range(0, total_size, chunk_size):
df = l[i:i+chunk_size]
df.to_excel(str(i)+"outputfilename.xlsx")
I wrote a function for this:
import numpy as np
import pandas as pd
def split_excel(file_name, n): # n: number of chunks or parts (number of outputed excel files)
df = pd.read_excel(file_name)
l = len(df)
c = l // n # c: number of rows
r = l % c
if r != 0: # if it is not divisible
df[-r:].to_excel(f'part_{l//c+1}.xlsx', index=False)
df = df[:-r]
i = 0
for part in np.split(df, l//c):
part.to_excel(f'part_{i}.xlsx', index=False)
i += 1
split_excel('my_file.xlsx')