0

I have a 20+GB dataset that is structured as follows:

1 3

1 2

2 3

1 4

2 1

3 4

4 2

(Note: the repetition is intentional and there is no inherent order in either column.)

I want to construct a file in the following format:

1: 2, 3, 4

2: 3, 1

3: 4

4: 2

Here is my problem; I have tried writing scripts in both Python and C++ to load in the file, create long strings, and write to a file line-by-line. It seems, however, that neither language is capable of handling the task at hand. Does anyone have any suggestions as to how to tackle this problem? Specifically, is there a particular method/program that is optimal for this? Any help or guided directions would be greatly appreciated.

Ammar
  • 1,305
  • 2
  • 11
  • 16

4 Answers4

0

You can try this using Hadoop. You can run a stand-alone Map Reduce program. The mapper will output the first column as key and the second column as value. All the outputs with same key will go to one reducer. So you have a key and a list of values with that key. You can run through the values list and output the (key, valueString) which is the final output you desire. You can start this with a simple hadoop tutorial and do mapper and reducer as I suggested. However, I've not tried to scale a 20GB data on a stand-alone hadoop system. You may try. Hope this helps.

TechCrunch
  • 2,924
  • 5
  • 45
  • 79
0

Have you tried using a std::vector of std::vector?

The outer vector represents each row. Each slot in the outer vector is a vector containing all the possible values for each row. This assumes that the row # can be used as an index into the vector.

Otherwise, you can try std::map<unsigned int, std::vector<unsigned int> >, where the key is the row number and the vector contains all values for the row.

A std::list of would work also.

Does your program run out of memory?

Edit 1: Handling large data files
You can handle your issue by treating it like a merge sort.
Open a file for each row number. Append the 2nd column values to the file. After all data is read, close all files. Open each file and read the values and print them out, comma separated.

Thomas Matthews
  • 56,849
  • 17
  • 98
  • 154
0
  1. Open output file for each key.
  2. While iterating over lines of source file append values into output files.
  3. Join output files.
0

An interesting thought found also on Stack Overflow

If you want to persist a large dictionary, you are basically looking at a database.

As recommended there, use Python's sqlite3 module to write to a table where the primary key is auto incremented, with a field called "key" (or "left") and a field called "value" (or "right").

Then SELECT out of the table which was the MIN(key) and MAX(key), and with that information you can SELECT all rows that have the same "key" (or "left") value, in sorted order, and print those informations to an outfile (if the database is not a good output for you).

I have written this approach in the assumption you call this problem "big data" because the number of keys do not fit well into memory (otherwise, a simple Python dictionary would be enough). However, IMHO this question is not correctly tagged as "big data": in order to require distributed computations on Hadoop or similar, your input data should be much more than what you can hold in a single hard drive, or your computations should be much more costly than a simple hash table lookup and insertion.

Community
  • 1
  • 1
logc
  • 3,813
  • 1
  • 18
  • 29