0

I am using python 2.7 and Pandas to load a bit large csv file (~10G) using Pandas 'read_csv' method. This action used to take 3-4 minutes until today, and suddenly it started taking hours without completing. The machine has 30G RAM and multiple CPUs, I checked the usage and nearly all of the memory and CPUs are free. Also the process's status is 'D' for most of the time (linux machine) which I read that usually happens during a wait for an I/O?

How can I debug this to find what's causing the problem?

Thank you

d1337
  • 2,543
  • 6
  • 24
  • 22
  • What has changed: the file? your pandas install? (As a side note, if you find yourself reading in the same csv often consider using pickle or HDF5Store). – Andy Hayden Jul 10 '13 at 11:42
  • Nothing has changed, that was the first thing I verified. And I tried pickling the data but for that I didn't have sufficient RAM even when things used to work. Is it possible that another user's action on the server are influencing this? if so how can I verify it? – d1337 Jul 10 '13 at 11:48
  • stupid question, but did you try a reboot if nothing has changed? – Joop Jul 10 '13 at 11:50
  • 1
    (Definitely recommend [pytables/HDF5Store](http://pandas.pydata.org/pandas-docs/stable/io.html#hdf5-pytables).) – Andy Hayden Jul 10 '13 at 11:55
  • can you print df.info? (if you have from a prior run), or about what your frame looks like (shape & dtypes)? – Jeff Jul 10 '13 at 12:56
  • will try HDF5Store, although it might be a problem to install since I am not the admin of the machine, and that's also why I can't reboot it... Searching for other run logs for the shape & dtypes – d1337 Jul 10 '13 at 13:19
  • you can install a virtualenv: http://stackoverflow.com/questions/5844869/comprehensive-beginners-virtualenv-tutorial – Jeff Jul 10 '13 at 13:34

0 Answers0