1

I am trying to read a 4GB CSV file using pandas using the code below:

tp = pd.read_csv('train.csv', sep='\t', iterator=True, chunksize=10000)
train = pd.concat(tp, ignore_index=True)

After using this I am able to read the csv but when I used : (train.shape) it shows number of columns to be "1" but there are 24 columns. I also tried to use sep as ',' but doing that returns me the optput on console as killed. I am using GC instance with 8GB RAM so no issues from that side. Also, if I try reading the CSV using :

pandas.read_csv("train.csv")

this fails for that I have recommended various other questions on stackoverflow which recommended me to read data in chunks.

Samyak Upadhyay
  • 573
  • 1
  • 12
  • 24
  • 1
    Can you read in a csv file that just contains some of the rows of your train.csv file? – Arco Bast Nov 05 '17 at 21:33
  • 1
    Add `nrows=100` and read the file. Then do `train.shape` to see what shape it is. If shape is not what you expect, your `sep` is wrong or perhaps you need to `skiprows` or something else is going on like encoding. – Jarad Nov 06 '17 at 02:56
  • @Jarad I am able to get what is expected using `sep=','` but then unable to read the complete csv which is needed for further computation so any suggestions on that? – Samyak Upadhyay Nov 06 '17 at 09:04
  • @ArcoBast Yeah that can be done but the main aim is to read the complete csv file using what Jarad suggested. – Samyak Upadhyay Nov 06 '17 at 09:06
  • 1
    Do you actually have a "Comma-separated file" (.csv) that's tab-delimited? You need to be certain what your separation character is first: comma or tab. The only way I'm able to recreate your shape of (n_rows, 1 col) is when reading a CSV file that's actually separated by a comma but `pd.read_csv` has specified `sep='\t'`. Second, you say "output on console as killed". Is there an error like `Memory Error` or can you be more specific ? – Jarad Nov 06 '17 at 21:00
  • @Jarad I tried with `sep=','` that works fine if I am trying to read only some part of the data but if I am trying to read the complete dataset it shows `Out of Memory` error on the terminal and when I see data usage my total memory is being used for the purpose. – Samyak Upadhyay Nov 07 '17 at 07:57
  • 1
    OK. Then my opinion is that your data is too large to fit into memory (hardware limitation). You may need to figure out how to process the data in chunks. https://stackoverflow.com/a/25962187/1577947 Or, you can use blaze which doesn't read into memory but allows you to query just what you need. Maybe someone else has ideas. Good luck – Jarad Nov 07 '17 at 15:38
  • @Jarad Thanks finally I had to read only those columns that were useful for my algorithm along with that I had to specify dtype of those columns. – Samyak Upadhyay Nov 09 '17 at 10:08

0 Answers0