I'm trying to process a CSV file having ~73Billion Rows,
I'm storing the processed rows into a python collections.defaultdict having string as key and tuples as value, however to store this data structure into dictionary is taking ~100 seconds to store 50K rows.
I'm processing the CSV file in chunks of 50K rows in order to make sure system doesn't go out of memory or to avoid disk spill I/O swapping operations.
Later on I'm loading those processed CSV files into Table and making a FULL OUTER JOIN to obtain the combined result.
Example ROW of CSV ID, value:
"10203","http://google.com/goo.gl?key='universe'&value='somedata'"
Data Structure:
dt = {'goog': [(10203, 1), ...}
Basically I'm trying to implement an algorithm for full text search feature - for that I need to maintain positions of value in parts of 4 characters with its associated ID.