Big data File: Read and Create structured file

Question

I have a 20+GB dataset that is structured as follows:

(Note: the repetition is intentional and there is no inherent order in either column.)

I want to construct a file in the following format:

1: 2, 3, 4

2: 3, 1

3: 4

4: 2

Here is my problem; I have tried writing scripts in both Python and C++ to load in the file, create long strings, and write to a file line-by-line. It seems, however, that neither language is capable of handling the task at hand. Does anyone have any suggestions as to how to tackle this problem? Specifically, is there a particular method/program that is optimal for this? Any help or guided directions would be greatly appreciated.

what exactly is your problem? Is it because your data doesn't fit in memory? — adrin, Apr 09 '14 at 16:28
please provide your code, so we can see the problem. and what is your input file type? — Ammar, Apr 09 '14 at 16:29

score 0 · Answer 1 · answered Apr 09 '14 at 16:31

You can try this using Hadoop. You can run a stand-alone Map Reduce program. The mapper will output the first column as key and the second column as value. All the outputs with same key will go to one reducer. So you have a key and a list of values with that key. You can run through the values list and output the (key, valueString) which is the final output you desire. You can start this with a simple hadoop tutorial and do mapper and reducer as I suggested. However, I've not tried to scale a 20GB data on a stand-alone hadoop system. You may try. Hope this helps.

score 0 · Answer 2 · answered Apr 09 '14 at 16:46

Have you tried using a std::vector of std::vector?

The outer vector represents each row. Each slot in the outer vector is a vector containing all the possible values for each row. This assumes that the row # can be used as an index into the vector.

Otherwise, you can try std::map<unsigned int, std::vector<unsigned int> >, where the key is the row number and the vector contains all values for the row.

A std::list of would work also.

Does your program run out of memory?

Edit 1: Handling large data files
You can handle your issue by treating it like a merge sort.
Open a file for each row number. Append the 2nd column values to the file. After all data is read, close all files. Open each file and read the values and print them out, comma separated.

score 0 · Answer 3 · answered Apr 09 '14 at 16:53

0

Open output file for each key.
While iterating over lines of source file append values into output files.
Join output files.

answered Apr 09 '14 at 16:53

Jan Bednařík

116
3

score 0 · Answer 4 · edited May 23 '17 at 11:57

An interesting thought found also on Stack Overflow

If you want to persist a large dictionary, you are basically looking at a database.

As recommended there, use Python's sqlite3 module to write to a table where the primary key is auto incremented, with a field called "key" (or "left") and a field called "value" (or "right").

Then SELECT out of the table which was the MIN(key) and MAX(key), and with that information you can SELECT all rows that have the same "key" (or "left") value, in sorted order, and print those informations to an outfile (if the database is not a good output for you).

I have written this approach in the assumption you call this problem "big data" because the number of keys do not fit well into memory (otherwise, a simple Python dictionary would be enough). However, IMHO this question is not correctly tagged as "big data": in order to require distributed computations on Hadoop or similar, your input data should be much more than what you can hold in a single hard drive, or your computations should be much more costly than a simple hash table lookup and insertion.

Big data File: Read and Create structured file

4 Answers4