How to manipulate a huge CSV file in python

Question

I have a CSV file more than 16G, each row is the text data. When I was encoding (e.g. one-hot-encode) the whole CSV file data, my process was killed due to the memory limitation. Is there a way to process this kind of "big data"?

I am thinking that split the whole CSV file into multiple "smaller" files, then append them to another CSV file, is that a correct way to handle the huge CSV file?

Incomplete. This probably depends on how you read the file. – H H Oct 12 '16 at 15:44 — H H, Oct 12 '16 at 15:44
@HenkHolterman using python, especially in Pandas library – Kun Oct 12 '16 at 16:17 — Kun, Oct 12 '16 at 16:17

Umar Farooq Khawaja · Answer 1 · 2016-10-12T15:34:03.247

Your question does not state what language you are using to handle this CSV file. I will reply using C#, but I imagine that the strategy will work equally well for Java too.

You can try using the StreamReader class to read the file line-by-line. That should take care of the read side of things.

Something like:

using (var reader = new StreamReader(...))
{
    var line = string.Empty;

    while ((line != reader.ReadLine()) != null)
    {
        Process(line);
    }
}

NB: That's a code snippet in C#, and is more pseudo-code than actual code.

You should create a database using some kind of local DB technology, either SQLite or SQL Server LocalDB or even MySQL and load the data into a table or tables in that.

You can then write any other further processing based on the data held in the database rather than in a simple text file.

Thanks, I edited the title to specify which language I am using. — Kun, Oct 12 '16 at 16:18

score 0 · Answer 2 · edited May 23 '17 at 11:48

0

This has been discussed in Reading huge csv files efficiently?

Probably the most reasonable thing to do with a 16GB csv file would not to load it all into memory, but read and process it line by line:

with open(filename, "r") as f:
    lines = csv.reader(f)
    for line in lines:
        #Process the line

edited May 23 '17 at 11:48

Community

1
1

answered Oct 12 '16 at 16:28

Markus Lippus

91
9

So I can read and process line by line instead of loading all into memory, right? So the writing to file would be line by line, correct? – Kun Oct 12 '16 at 16:40
Yes, that's how you should do it if you want to get a modified copy of the file. Open another file and write processed lines to that one. – Markus Lippus Oct 13 '16 at 06:35

How to manipulate a huge CSV file in python

2 Answers2