Read large xlsx with 2 sheets or csv into dataframe

Question

I have a xlsx file with 11 columns and and 15M rows and 198Mb in size. It's taking forever with pandas to read and work. After reading Stackoverflow answers, I switched to dask and modin. However, I',m receiving the following error when using dask:

df = dd.read_csv('15Lacs.csv', encoding= 'unicode_escape')

c error :out of memory .

When I use modin['ray'] I get the following error:

UnicodeDecodeError: 'utf-8' codec can't decode byte 0xa0 in position 112514: invalid start byte

Is there a more efficient way to import large xlsx or csv files to python on average hardware?

You are asking at least two questions in one here. Could you please rephrase, concentrate on one, and add debugging details? — sophros, Sep 30 '20 at 17:18
198MB is not that large of a file, what are the specs on the machine you are using — gold_cy, Sep 30 '20 at 17:20
Do you want to use all the data together? if you don't, you can load it part by part — Mohammad Sadra Sharifzadeh, Sep 30 '20 at 17:21
@Mohammad I just want to filter rows according to pincode in next column. — Vishal Kamlapure, Sep 30 '20 at 17:25
@sophros My primary Question is how can i load this data and work with it in faster way ? — Vishal Kamlapure, Sep 30 '20 at 17:26
@VishalKamlapure: which data (CSV or Excel file)? what method exactly you are using (show us the complete code)? How exactly you are filtering? Show us the sample of the data. Otherwise we are unable to help. — sophros, Sep 30 '20 at 17:28

score 1 · Answer 1 · answered Sep 30 '20 at 17:39

If you're in dask,

df = dd.read_csv('15Lacs.csv', encoding= 'unicode_escape', blocksize="8MB")

If you're in pandas,

for batch in pd.read_csv('15Lacs.csv', chunksize=1000):
    process(batch)

I'm guessing you're filling up your ram with loading this plus a bunch of other things and running Windows?

Read large xlsx with 2 sheets or csv into dataframe

1 Answers1