Sort words within memory constraint

Question

Interview prep qn: Given a long list of words, return a list of distinct words along with the count of how many occurred using only 16gb of memory.

I am thinking of using a HashMap which stores only one copy of the word and then subsequently increments the value if the same word is encountered again. This is the only approach I can think of. Overall complexity would be O(n) as I need to traverse through the entire list of words once to populate my hashmap with the word and/or increment its count.

What I am unsure about is how to factor in the 16GB of memory fact into my answer?

If each word is 100 bytes (which likely is not as word length can vary), then my hashmap can be of x number of words long, so what? How can I logically arrive at the limit of my solution? I'm a little unsure of how to proceed here..

can you fit it in memory at one time , this is important as this fact will change the algorithm accordingly — Vikram Bhat, Oct 27 '14 at 05:42

Stephen C · Answer 1 · 2014-10-27T06:49:10.917

Firstly, this is not a sorting problem. At a fundamental level it is a collating problem.

I can think of three approaches to solving it ... depending on the number of distinct words you have.

If the number of words is small enough that you can hold each distinct word and a count in some kind of map (e.g. a TreeMap<String, Integer>) then the solution is simple and obvious. This requires space (in memory) that is proportional to the number of distinct words. (A TreeMap gives you the counts in word order ...)
If you can't do that, then there are two approaches:
- The sort-merge approach:
  1. Divide the input file into N sub-files.
  2. Sort the words in each sub-file (in memory) and wrote out.
  3. Open the N sorted sub-files, and read-and-count the words, each time selecting and counting the smallest word presented on the N files.
  The last step can be performed with space that is proportional to N, and the list of counts can be written straight out to a file. Unless N (the number of subfiles) is huge, you can easily accommodate an input file with "too many" distinct words.
- The partitioning approach:
  1. Choose N words that partition the "space" of the words into N roughly equal parts.
  2. Then read the input file N times, selecting words between word_i and word_i+1, collate / count them (using a TreeMap as above).
  3. After each collate / count, append the lists of counts to the output file.

The three approaches have roughly the same runtime complexity (O(MlogM)) assuming that TreeMap is used. If you use HashMap, you get O(M) complexity, but the output list is not ordered ... and it doesn't work for the "complicated" approaches.

in a way we are sorting a list of words into a list of distinct words with their respective count, arent we? thanks for the approaches. — stretchr, Oct 27 '14 at 07:25
In a way. But that is not how we use the word "sort" in IT. At the fundamental level, this does not require us to place the words into any specific order, and that is what "sorting" means in IT. — Stephen C, Oct 27 '14 at 07:27

score 1 · Answer 2 · answered Oct 27 '14 at 06:31

Below will work with O(n)

Class WordStore
{
    private static HashSet<String> words;
    private int count;
    private long byte;
    //Singleton approach
    public static int addWord(String word)
    {
        if(byte!=17179869184 || (byte+word.length()<17179869184) //checking if the words size is upto 16GB 
        {
            words.add(word);
            count++;
            byte+=byte+word.length();
            return 0;
        }
        else
        {
            System.out.println("Word Count in 16GB :"+count);
            return 1;
        }
    }
}
Class Loader
{
    public static void main(String[] a)
    {
        while(1)
        {       
            String a=readWordOneByOne();
            if(WordStore.addWord(a))
            {
                break;
            }
        }
    }
}

hi, what does this mean: (byte!=17179869184 || (byte+word.length()<17179869184)? — stretchr, Oct 27 '14 at 07:18
17179869184bytes is 16Gb i'm count length of each string and adding up as 1 char mean 1 byte so max num of char would be 17179869184 to be within 16GB, first condition is we'll add word if total number of bytes is not equal to 17179869184 or total number of byte added+length of next word is less than 17179869184 (ie 16GB) — Kumar Gaurav, Oct 27 '14 at 07:22

cybye · Answer 3 · 2014-10-27T08:34:23.223

Read in the list and transform the words like word -> (word, 1). Then you collect all the pairs with the same word in a collector an sum the counter like (word,1) + (word,1) = (word,2). A hashmap performs well here.

If the collector grows over a specific limit, write the tuples to disk (a good point for tuning, like writing only some of the less frequent words). The output of the collector is ((word1, count1,) ... (wordN,countN)). When writing you should split the tuples by a hash-key(word) into some files that will be smaller than the memory, say filename= "split" + numwrites + (key%5) where numwrites is just a counter for the splits.

As each word will be in a specific split-suffix (key%5), you now just read and sum all the split with the same ending.

Note, that this hash-based approach will work only, if you have an idea about the amount of splits. If you do not have any, the collector can sort the tuples for each of its output files and you merge them in a last phase. Merge-Sort does not have any bigger memory constraints but has complexity O(n*log(n))

See MapReduce for the description of the general approach.

Vikram Bhat · Answer 4 · 2014-10-27T11:01:14.127

There are three possible approaches depending on the size of list :-

If size of list fits in the memory : Then load the entire list and sort it using quick sort and do running count and store it in output file.
If size of list too big for memory but no. of unique words can fit in memory : maintain a TreeSet or HashMap with count and read words one by one and add the count in HashMap.
Even Unique words cannot fit in memory :

divide the memory in two parts: one for fetching and processing words, second for maintaining counts.

read input till it fills part one

sort the input.

in second part maintain TreeSet and array.

TreeSet maps word to index in array

traverse sorted array in sequential manner

after counting word add its count to array and name to TreeSet.

As the input ends , reload next set of words till part one is filled.

If second part is full then you need to replace an entry and save it to disk

The replacement candidate can be carefully chosen as word which is lexicographically farthest from current word, in this way the number of replacement needed would be reduced.

check if word has been already encountered by checking if its file name exists.

If file name present load the count from file name and add one in the array

score 0 · Answer 5 · answered Oct 27 '14 at 06:05

0

Just use disk/files since you have memory constraints. Let's say you have a 48GB file of words. Load the first 1/3 of the file into the memory (16GB). As you are iterating through words, if a file by the word name does not exist, create a file by the word name and write 1 into it. If a file by the name exists, increment the value in the file by 1 and save it. Then, repeat this process for the 2nd half of the file and the 3rd half of the file. In the end, you will have a file per unique word and the contents of the file will be number of times the word was encountered.

answered Oct 27 '14 at 06:05

govin

6,445
5
43
56

This is very slow processs basically you are not using memory speed at all. But nice idea +1 – Vikram Bhat Oct 27 '14 at 09:19
1

While an interesting idea, that won't necessarily work. [Check the restrictions on number of files](http://stackoverflow.com/a/466596/1860929) in a directory. – Anshul Goyal Feb 25 '15 at 15:57

score 0 · Answer 6 · answered Oct 27 '14 at 06:14

0

The best approach (that is known) is to use threads (with consumer and producer capabilities and semiphores), stores via a hash, and reads in chunks. Number of threads and chunk size should be dependent on the size of the file.

This way, the only limitation is how much can be written. Thus, a character count would suffice and let the program pick back up where it left off once more memory is obtained.

answered Oct 27 '14 at 06:14

T.Woody

1,142
2
11
25

hi, thanks...can you elaborate more on your answer, i can kind of understand what you mean but would appreciate more elaboration..how would you use threads, stored via a hash, read in chunks? – stretchr Oct 27 '14 at 06:24
1

@stretchr Threads would be used to have N number of producers such that a buffer is constantly being filled. This buffer is being accessed by the consumer, which is adding the new list obtained by each producer to the specific place in the hash. Reading in chunks is just reading the file in some number of bytes. Of course, by doing it this way, some byes have to be put back into the stream if what is read is not white space. Of course, this is assuming that other treating had been done such that we ate actually dealing with words and not done word file with no white space... – T.Woody Oct 27 '14 at 06:46
Seems like you're using the concept of asynchronocism, but we have only 1 X 16 GB of memory space, so the producers can put only enough words into the buffer (is this the memory space?) up to a certain limit, while consumers can retrieve from the buffer to update the hash within this limit. So I think a crucial clarification is : Is this hashmap is being stored in the memory? Or in a disk? – stretchr Oct 27 '14 at 07:22
@stretchr, Of course it's on disk, too a degree. Once the hash goes through so many iterations, the output has to be stored in order to keep going. – T.Woody Oct 27 '14 at 07:59

Sort words within memory constraint

6 Answers6