3

I am trying to load this CSV file into a pandas data frame using

import pandas as pd
filename = '2016-2018_wave-IV.csv'

df = pd.read_csv(filename)

However, despite my PC being not super slow (8GB RAM, 64 bit python) and the file being somewhat but not extraordinarily large (< 33 MB), loading the file takes more than 10 minutes. It is my understanding that this shouldn't take nearly that long and I would like to figure out what's behind this. (As suggested in similar questions, I have tried using chunksize and usecol parameters (EDIT and also low_memory), yet without success; so I believe this is not a duplicate but has more to do with the file or the setup.)

Could someone give me a pointer? Many thanks. :)

Ivo
  • 3,890
  • 5
  • 22
  • 53
  • Did you see [this](https://stackoverflow.com/questions/17557074/memory-error-when-using-pandas-read-csv)? setting parameter low_memory to False should work – Vaishali Feb 24 '19 at 18:53
  • @Vaishali, thanks a lot - I have tried unsuccessfully but forgot to mention it. Thanks for the pointer, though! :) – Ivo Feb 24 '19 at 19:00
  • 1
    This is a huge file. The disk size does not indicate for low data in the file. Because this file contains text the disk size is low but the amound of data in it is big. Try to read it chunk by chunk.. [how do you split reading a large csv file into evenly sized chunks in python](https://stackoverflow.com/questions/4956984/how-do-you-split-reading-a-large-csv-file-into-evenly-sized-chunks-in-python) – DavidDr90 Feb 24 '19 at 20:08

2 Answers2

1

I was testing the file which you shared and problem is that this csv file have leading and ending double quotes on every line (so Panda thinks that whole line is one column). It have to be removed before processing for example by using sed in linux or just process and re-save file in python or just replace all double quotes in text editor.

Hubert Dudek
  • 1,666
  • 1
  • 13
  • 21
  • Thanks, this and some other (similar) issues caused the problem. Is there any way to do this in python? – Ivo Mar 01 '19 at 16:18
0

To summarize and expand the answer by @Hubert Dudek:

The issue was with the file; not only did it include "s at the start of every line but also in the lines themselves. After I fixed the former, the latter caused the column attribution being messed up.

Ivo
  • 3,890
  • 5
  • 22
  • 53