how do I read a large csv(20G)

Question

I am a new user of python.my problem is this:

I have three csv files (each is about 15G, and has three columns), and I want to read them into python and get rid of rows which dur=0 my csv is like this.

sn_fx   sn_tx   dur
5129789 3310325 2
5129789 5144184 1
5129789 5144184 1
5129789 5144184 1
5129789 5144184 1
5129789 6302346 4
5129789 6302346 0

I know I should read line by line , and I try like this :

file='cmct_0430x.csv'
for line in file.xreadlines():
    pass

but it seems do not work.

Besides, I do not know how to make these lines transform into a dataframe.

Could someone show me more details about this, I will appreciate you very much!

Python has a `csv` module, or otherwise use `pandas`. But first verify that you have enough memory to read this file. — , Nov 16 '16 at 05:09
There are a number of questions this is a duplicate of. Without any info on what you're going to do with the data, it's impossible to tell which one fits the best. — ivan_pozdeev, Nov 16 '16 at 05:11
Also http://stackoverflow.com/questions/9087039/most-efficient-way-to-parse-a-large-csv-in-python, http://stackoverflow.com/questions/25962114/how-to-read-a-6-gb-csv-file-with-pandas . — ivan_pozdeev, Nov 16 '16 at 05:12
You are getting memory error because you are processing chunk of size larger than your memory size. — FallAndLearn, Nov 16 '16 at 05:12
Please don't just say "it seems not to work" when you ask questions here. Instead be as specific as possible and include a traceback if possible. Off-topic: `file.xreadlines()` has been deprecated since Python 2.3, use `for line in file:` instead (see the [`xreadlines` documentation](https://docs.python.org/2/library/stdtypes.html#file.xreadlines)). — martineau, Nov 16 '16 at 05:23
I am sorry for ask a question maybe duplicate ,I will search more next time. — lemon, Nov 16 '16 at 05:52

score 3 · Answer 1 · answered Nov 16 '16 at 05:11

You should use pandas. And read the csv in chunks (number of rows processed) of suitable size. Then use concat to get all the chunks.

from pandas import *

tp = read_csv('cmct_0430x.csv', iterator=True, chunksize=1000)
df = concat(tp, ignore_index=True)

Pandas : Read_csv

You are getting memory error because you are processing entire csv at a time which is larger than the size of your main memory. Try to break it in chunks and then process it.

thank you ,I just tried ,it out of memory – lemon Nov 16 '16 at 05:23 — lemon, Nov 16 '16 at 05:23

how do I read a large csv(20G)

1 Answers1