0

I have a xlsx file with 11 columns and and 15M rows and 198Mb in size. It's taking forever with pandas to read and work. After reading Stackoverflow answers, I switched to dask and modin. However, I',m receiving the following error when using dask:

df = dd.read_csv('15Lacs.csv', encoding= 'unicode_escape') 

c error :out of memory .

When I use modin['ray'] I get the following error:

UnicodeDecodeError: 'utf-8' codec can't decode byte 0xa0 in position 112514: invalid start byte

Is there a more efficient way to import large xlsx or csv files to python on average hardware?

Yaakov Bressler
  • 9,056
  • 2
  • 45
  • 69
Vishal Kamlapure
  • 590
  • 4
  • 16

1 Answers1

1

If you're in dask,

df = dd.read_csv('15Lacs.csv', encoding= 'unicode_escape', blocksize="8MB")

If you're in pandas,

for batch in pd.read_csv('15Lacs.csv', chunksize=1000):
    process(batch)

I'm guessing you're filling up your ram with loading this plus a bunch of other things and running Windows?