7

I have been trying to read a few large text files (sizes around 1.4GB - 2GB) with Pandas, using the read_csv function, with no avail. Below are the versions I am using:

  • Python 2.7.6
  • Anaconda 1.9.2 (64-bit) (default, Nov 11 2013, 10:49:15) [MSC v.1500 64 bit (AMD64)]
  • IPython 1.1.0
  • Pandas 0.13.1

I tried the following:

df = pd.read_csv(data.txt')

and it crashed Ipython with a message: Kernel died, restarting.

Then I tried using an iterator:

tp = pd.read_csv('data.txt', iterator = True, chunksize=1000)

again, I got the Kernel died, restarting error.

Any ideas? Or any other way to read big text files?

Thank you!

marillion
  • 10,618
  • 19
  • 48
  • 63
  • I did not get this error on my machine, with a similar configuration than yours. How much RAM memory do you have? On my machine Python needed a peak of around 5GB to read a csv with 2.9 GB using `pd.read_csv()` – Saullo G. P. Castro May 01 '14 at 16:25
  • 1
    @SaulloCastro My machine has 8GB installed. It should be able to handle such a filesize, since most of the installed RAM is available, I am not running anything else. – marillion May 01 '14 at 16:38

1 Answers1

9

A solution for a similar question was given here some time after the posting of this question. Basically, it suggests to read the file in chunks by doing the following:

chunksize = 10 ** 6  # number of rows per chunk
for chunk in pd.read_csv(filename, chunksize=chunksize):
    process(chunk)

You should specify the chunksize parameter accordingly to your machine's capabilities (that is, make sure it can process the chunk).

Laurent S
  • 4,106
  • 3
  • 26
  • 50
DarkCygnus
  • 7,420
  • 4
  • 36
  • 59
  • what is 10 ** 6, please enlighten us lesser enlightened ones?? Also, this does not give the solution of storing the chunk into dataframe and concatenation of all such dataframes afterwards. – Rahul Saini Jul 09 '19 at 16:47
  • That 10 raise to power 6 is intuitive. what is it KB, MB, lines in the file, what is it ??? – Rahul Saini Jul 09 '19 at 16:53
  • Perhaps a more explanatory and useful link be mentioned here : https://pythondata.com/working-large-csv-files-python/ – Rahul Saini Jul 09 '19 at 16:56
  • Oh, sorry didn't get you quite right. It's number of rows per chunk. – DarkCygnus Jul 09 '19 at 16:57
  • I suggest you check the target dupe question as it has relevant and useful info for you :) thanks for the link also, will check it out – DarkCygnus Jul 09 '19 at 16:57