3

I'm trying to process a CSV file having ~73Billion Rows,
I'm storing the processed rows into a python collections.defaultdict having string as key and tuples as value, however to store this data structure into dictionary is taking ~100 seconds to store 50K rows.

I'm processing the CSV file in chunks of 50K rows in order to make sure system doesn't go out of memory or to avoid disk spill I/O swapping operations.

Later on I'm loading those processed CSV files into Table and making a FULL OUTER JOIN to obtain the combined result.

Example ROW of CSV ID, value:

"10203","http://google.com/goo.gl?key='universe'&value='somedata'"

Data Structure:

dt = {'goog': [(10203, 1), ...}

Basically I'm trying to implement an algorithm for full text search feature - for that I need to maintain positions of value in parts of 4 characters with its associated ID.

Brian Tompsett - 汤莱恩
  • 5,753
  • 72
  • 57
  • 129
shahjapan
  • 13,637
  • 22
  • 74
  • 104
  • possible duplicate of [Parallel processing of a large .csv file in Python](http://stackoverflow.com/questions/8424771/parallel-processing-of-a-large-csv-file-in-python) – hjpotter92 Jun 09 '13 at 05:25
  • 2
    When you are processing that many rows, you clearly must do processing "on the fly" as you cannot store this much data in memory on most machines (73 billion rows = probably over 1 TB of memory?). Without knowing more about the data structure it will be hard to give any advice; but almost certainly a dict structure is not the most efficient way to go. What kind of data is this, and how are you aggregating it? – Floris Jun 09 '13 at 05:26
  • whats wrong with what you are doing? why do you think that 100seconds for 50K rows is slow? Also can you post one row from the CSV file for our understanding? – Srikar Appalaraju Jun 09 '13 at 05:49
  • Yes that's why I'm doing it in chunks, its like aggregating 50K rows at a time in memory and then dumping it into a temp table later on I'm creating a new final aggregated table by making full outer join of each individual table. Data structure is fairly simple its 4 characters as key i.e. 'asdf' and for each 'asdf' key aggregated value for 50K would be like "[(1,0), (2,5)]" – shahjapan Jun 09 '13 at 05:50
  • @SrikarAppal Wrong thing is its taking over night to process the data for aggregation, and this is a daily task - I'm suppose to finish this processing ~ 2-3 hours – shahjapan Jun 09 '13 at 05:55
  • 2
    Why don't you simply load the file directly in the database? – Burhan Khalid Jun 09 '13 at 05:56
  • It has been dumped from database only for processing; it generates another table after processing, so once processing get done it will be loaded back to database. – shahjapan Jun 09 '13 at 05:58
  • @shahjapan: would you consider writing a database routine to do the work without all this to-CSV, from-CSV stuff? – John Zwinck Jun 09 '13 at 06:15
  • Database routine to iterate over 17Billions rows won't make database down ? – shahjapan Jun 09 '13 at 06:36

2 Answers2

5

Here are some things that come to mind -

  1. As @Burhad suggest, why cant you load this file directly in to the DB? Any kind of string processing like you are doing can be done in regular RDBMS like MySQL. They have string function you know. A simple DB routine could do this all within the DB itself without even writing the data to file in the first place.
  2. If you dont want to take the above approach. I suggest you try this. Split the file into lets say n smaller files. Start a master process which forks n sub-processes to process these n chunk files in parallel. that way in 100 seconds you would theoretically get n * 50,000 rows processed. Note that I am saying "theoretically" since if all this is happening on a single harddisk, the harddisk might not transmit data concurrently. So there might be delay in satisfying concurrent requests (but again the I/O algorithms that run on modern operating systems cache a portion of the file being read which might give you close to the above mentioned figures).
  3. An extension of above approach is to use multiple hardisks all being part of the same raid level on the same local machine.
  4. If you require even more throughput think distributed computing. Like say x machines each with y harddisks. Make x*y chunks of your file and put them in these machines. and run your processing program. So you process x*y*50000 rows in the same 100 seconds. The throughput increases with the number of machines and harddisks employed. You might have to deal with some newer problems of distributed computing (like availability, fault tolerance etc.) but still...

Point 2, 3 and 4 are predicated on the assumption that each row in your CSV file can be processed independently and that there is no dependency amongst rows.

Srikar Appalaraju
  • 71,928
  • 54
  • 216
  • 264
0

Finally I have found the perfect solution, best suitable for my current requirement.

Previously this task was running around ~20-24 hours, and now it takes around half an hour.

The programming model I was looking for was Map Reduce programming model. Which was easy to use and easy to code for the requirement I had.

Its really faster & efficiently written: I'm using gpmapreduce utility with Python Programming language for the same.

Thanks to: @Srikar-Appal its almsot similar to his 4th solution - based on which I inspired to use mapreduce model.

shahjapan
  • 13,637
  • 22
  • 74
  • 104