1

I am a new user of python.my problem is this:

I have three csv files (each is about 15G, and has three columns), and I want to read them into python and get rid of rows which dur=0 my csv is like this.

sn_fx   sn_tx   dur
5129789 3310325 2
5129789 5144184 1
5129789 5144184 1
5129789 5144184 1
5129789 5144184 1
5129789 6302346 4
5129789 6302346 0

I know I should read line by line , and I try like this :

file='cmct_0430x.csv'
for line in file.xreadlines():
    pass

but it seems do not work.

Besides, I do not know how to make these lines transform into a dataframe.

Could someone show me more details about this, I will appreciate you very much!

Danny_ds
  • 11,201
  • 1
  • 24
  • 46
lemon
  • 149
  • 2
  • 10
  • Python has a `csv` module, or otherwise use `pandas`. But first verify that you have enough memory to read this file. –  Nov 16 '16 at 05:09
  • There are a number of questions this is a duplicate of. Without any info on what you're going to do with the data, it's impossible to tell which one fits the best. – ivan_pozdeev Nov 16 '16 at 05:11
  • thank you.I have tried pd.read_csv,but it has memory error. – lemon Nov 16 '16 at 05:11
  • Also http://stackoverflow.com/questions/9087039/most-efficient-way-to-parse-a-large-csv-in-python, http://stackoverflow.com/questions/25962114/how-to-read-a-6-gb-csv-file-with-pandas . – ivan_pozdeev Nov 16 '16 at 05:12
  • You are getting memory error because you are processing chunk of size larger than your memory size. – FallAndLearn Nov 16 '16 at 05:12
  • 1
    Please don't just say "it seems not to work" when you ask questions here. Instead be as specific as possible and include a traceback if possible. Off-topic: `file.xreadlines()` has been deprecated since Python 2.3, use `for line in file:` instead (see the [`xreadlines` documentation](https://docs.python.org/2/library/stdtypes.html#file.xreadlines)). – martineau Nov 16 '16 at 05:23
  • I am sorry for ask a question maybe duplicate ,I will search more next time. – lemon Nov 16 '16 at 05:52

1 Answers1

3

You should use pandas. And read the csv in chunks (number of rows processed) of suitable size. Then use concat to get all the chunks.

from pandas import *

tp = read_csv('cmct_0430x.csv', iterator=True, chunksize=1000)
df = concat(tp, ignore_index=True) 

Pandas : Read_csv

You are getting memory error because you are processing entire csv at a time which is larger than the size of your main memory. Try to break it in chunks and then process it.

FallAndLearn
  • 4,035
  • 1
  • 18
  • 24