0

I'm trying to read Excel/csv file with data more than 1m and split it

For example:

I have Excel file with 2.7M data and now I want to split it in 3 files of 0.9M data using python.

Code:

import pandas as pd

df = pd.read_excel("/path/to/excels/file.xlsx")

n_partitions = 3

for i in range(n_partitions):
    sub_df = df.iloc[(i*n_paritions):((i+1)*n_paritions)]
    sub_df.to_excel(f"/output/path/to/test-{i}.xlsx", sheet_name="a")

I tried this it's working fine for less data but want something that can help when we have data more than 1m as we all know Excel have limitation upto 1m to show data.

shaik moeed
  • 5,300
  • 1
  • 18
  • 54
Keshav
  • 59
  • 6
  • your partition logic is wrong, you need to divide the total row with n_partitions and then increment by this value in each iteration. – shaik moeed Apr 01 '23 at 20:01
  • Does this answer your question? [Split a large pandas dataframe](https://stackoverflow.com/questions/17315737/split-a-large-pandas-dataframe) – HedgeHog Apr 02 '23 at 12:59

2 Answers2

1

Correction

Misunderstood the question at the beginning, so my option only reflects the split by number of lines - There is a well asked and answered question that will cover this issue split large dataframe


Split dataframe by given number of rows - You could adjust your range() and extend it by the step parameter that will get the value of your n_partitions:

for i in range(0,df.shape[0],n_partitions):
    df[i:i+n_partitions].to_excel(f"/output/path/to/test-{i}.xlsx", sheet_name="a")

Exanple:

import pandas as pd

df = pd.read_excel("/path/to/excels/file.xlsx")

n_partitions = 3
for i in range(0,df.shape[0],n_partitions):
    df[i:i+n_partitions].to_excel(f"/output/path/to/test-{i}.xlsx", sheet_name="a")
HedgeHog
  • 22,146
  • 4
  • 14
  • 36
1

as we all know Excel have limitation upto 1m to show data.

Keeping this condition in mind and in a scenario where you want to scale up at any time given dynamic changes, instead of doing mental math every time for the number of partitions, keep the row/records limit defined as per your need. That way you control how much data you keep in each partition rather than the other way around.

limit = 100000

Here I've kept it at 100,000 for slightly better readability in Excel. Then you can simply go ahead and partition using the dataframe.iloc functionality as you have done originally.

n = 1
i = 0
while (i < df.shape[0]):
    sub_df = df.iloc[i:limit+i]
    sub_df.to_excel(f"/output/path/to/test-{n}.xlsx", sheet_name="a")
    n += 1
    i += limit

It's a rather naive approach but it works and reads well.