1

I have a CSV file more than 16G, each row is the text data. When I was encoding (e.g. one-hot-encode) the whole CSV file data, my process was killed due to the memory limitation. Is there a way to process this kind of "big data"?

I am thinking that split the whole CSV file into multiple "smaller" files, then append them to another CSV file, is that a correct way to handle the huge CSV file?

Community
  • 1
  • 1
Kun
  • 581
  • 1
  • 5
  • 27

2 Answers2

0

Your question does not state what language you are using to handle this CSV file. I will reply using C#, but I imagine that the strategy will work equally well for Java too.

You can try using the StreamReader class to read the file line-by-line. That should take care of the read side of things.

Something like:

using (var reader = new StreamReader(...))
{
    var line = string.Empty;

    while ((line != reader.ReadLine()) != null)
    {
        Process(line);
    }
}

NB: That's a code snippet in C#, and is more pseudo-code than actual code.

You should create a database using some kind of local DB technology, either SQLite or SQL Server LocalDB or even MySQL and load the data into a table or tables in that.

You can then write any other further processing based on the data held in the database rather than in a simple text file.

Umar Farooq Khawaja
  • 3,925
  • 1
  • 31
  • 52
0

This has been discussed in Reading huge csv files efficiently?

Probably the most reasonable thing to do with a 16GB csv file would not to load it all into memory, but read and process it line by line:

with open(filename, "r") as f:
    lines = csv.reader(f)
    for line in lines:
        #Process the line
Community
  • 1
  • 1
  • So I can read and process line by line instead of loading all into memory, right? So the writing to file would be line by line, correct? – Kun Oct 12 '16 at 16:40
  • Yes, that's how you should do it if you want to get a modified copy of the file. Open another file and write processed lines to that one. – Markus Lippus Oct 13 '16 at 06:35