How to find all unique words in a 10 GB file or more & enable search, using JavaScript?

Question

The question is to implement a web service that can read a 10GB file and store all distinct words & their occurrences. The requirements needs to be solved in O(n) or better complexity. The next part of the question is to write all client side code to allow search based on keypress. How do I approach this problem? What would you suggest, are the main sub-headings?Do we need to use some sort of in-memory caching? Can 1 computer handle searching 10GB of data? Is there an approximation I should consider for distinct words based on Language (For example, in Cracking the coding interview I read there are about 600,000 distinct words in a language). How do I handle scalability of a system built this way? I really need help structuring my thoughts! Thanks in advance!

This is a little broad, as it seems like you have quite a few questions here. That being said, approaching this sort of task with JS or any other very high-level language scares me — Kai, Jul 24 '17 at 02:31
Thanks for your response @Kai . For me, this is a complex problem and I asked here because I have seen amazing discussions here. May be I also need help breaking down this problem :) — theusual, Jul 24 '17 at 02:33
I would very much appreciate the reason behind a downvote, so I can fix it later. This is a fairly complex question for me and I am seeking some expert advice, hence I had to post it here. Thanks! — theusual, Jul 24 '17 at 02:34
10Gb of data is nothing scary nowadays. One decent computer will be able to handle this with enough time, however, you may get a 'long running script' warning from your browser. Why do you want to store the data separately from the file? — bowl0stu, Jul 24 '17 at 02:38
This is an interview question @bowl0stu . I need some pointers (topics) to divide my problem into, so I can work on those separate topics and build a good answer :) — theusual, Jul 24 '17 at 02:40

score 2 · Accepted Answer · answered Jul 24 '17 at 02:44

2

You shouldn't be using JavaScript for this. Pretty much any language will have better performance.

But, setting that aside, let's answer the question. What you'll want to do is create a Set and iterate through all words. Given the size of the data, you'll probably want to split it into chunks beforehand or at read time.

Just adding the key to the Set every time will suffice, as set only contains unique elements.

Alternatively, if you have 10+GB of RAM, just put the whole thing into an array and cast it to a set. Then you'll be able to read the unique values. It'll take quite a while, though.

answered Jul 24 '17 at 02:44

jhpratt

6,841
16
40
50

Thanks! Is this all to it? How do I build a system that scales well? What are the ways to split my data into chunks? – theusual Jul 24 '17 at 02:54
Unfortunately it's an inherently linear problem, as you have to parse every word no matter what. You can split the data into chunks by reading large values into memory (say 1 GB) and parsing that, then repeating with the next chunk (next 1 GB). You can see how to do this [here](https://stackoverflow.com/a/25813769/2718801). You'll have to account for the possibility of split words, but I'll leave that to you. – jhpratt Jul 24 '17 at 02:57
Of course you provided some useful hints. I am looking for more clarity. For example: 1. did I go correct with my list of languages & max possible distinct words reference? 2. Is any kind of client side caching relevant here? 3. I read randomly I should be using hashmap to store word occurrences & min heap- why min heap? – theusual Jul 24 '17 at 03:10
1. As I said other languages would be better, but JS is about it for client side (unless you're into WebAssembly). 2. Not sure what you mean, as everything has to be done in memory anyways. 3. A min-heap will allow you to search for values the way you specified in the question. – jhpratt Jul 24 '17 at 04:20

How to find all unique words in a 10 GB file or more & enable search, using JavaScript?

1 Answers1