how to deal with a big text file(about 300M)

Question

There's a text file(about 300M) and I need to count the ten most offen occurred words(some stop words are exclued). Test machine has 8 cores and Linux system, any programming language is welcome and can use open-source framework only（hadoop is not an option), I don't have any mutithread programming experince, where can I start from and how to give a solution cost as little time as possible?

score 0 · Answer 1 · answered Aug 12 '13 at 14:07

300M is not a big deal, a matter of seconds for your task, even for single core processing in a high-level interpreted language like python if you do it right. Python has an advantage that it will make your word-counting programming very easy to code and debug, compared to many lower-level languages. If you still want to parallelize (even though it will only take a matter of seconds to run single-core in python), I'm sure somebody can post a quick-and-easy way to do it.

For some tasks a matter of seconds could represent a big deal. — vegatripy, Oct 31 '16 at 10:17

dopplesoldner · Answer 2 · 2013-08-12T14:36:10.303

0

Assuming that you have 1 word per line, you can do the following in python

from collections import Counter

FILE = 'test.txt'
count = Counter()

with open(FILE) as f:
    for w in f.readlines():
        count[w.rstrip()] += 1

print count.most_common()[0:10]

edited Aug 12 '13 at 14:36

answered Aug 12 '13 at 14:17

dopplesoldner

8,891
12
44
55

I need to make the programming as fast as possible, get the right answer is only part of the job. – novice Aug 12 '13 at 15:01
1

The above code runs in **O(n)** time. You can't better O(n) time since you **HAVE TO** see each word atleast once. The only thing you can improve is the space requirements - this uses O(n) space. You can do further optimisations to reduce that. – dopplesoldner Aug 12 '13 at 15:20
@dopplesoldner Though it is O(n), depending on the hardware and FS, you *might* be able to read the file on parallel, which will effectively make the program run much faster, and more scalable. Sometimes, O(n) is not enough, especially when the OP explicitly is asking about optimizing for multicore system. – amit Aug 12 '13 at 17:30

MrSmith42 · Answer 3 · 2013-08-12T14:40:22.297

0

Read the file and create a map [Word, count] of all occurring word as keys and the value are the number of occurrences of the words while you read it.

Any language should do the job.

After reading the File once, you have the map.

Then iterate through the map and remember the ten word with the highest count value

edited Aug 12 '13 at 14:40

answered Aug 12 '13 at 14:19

MrSmith42

9,961
6
38
49

thanks,but it's too slow, and the reason I post this question is I want find a faster solution. – novice Aug 12 '13 at 15:03
You will not find a faster solution than *O(n)*. You need to read the file once. – MrSmith42 Aug 12 '13 at 16:17
@MrSmith42 Depends on the FS and the hardware, you might be able to read it in parallel in some cases. – amit Aug 12 '13 at 17:28
@amit: Even if you can read in parallel, the complexity still is *O(n)*. Speed would only improve by a constant factor. – MrSmith42 Aug 12 '13 at 18:01
@MrSmith42 I challange you to try a word count on the internet, or on google index. Practically, when distributed on a huge cluster, it takes ~10,000 faster, and make an otherwise infeasible solution feasible. So, though it is O(n), it is NOT optimal for time consumption, since constants do matter. – amit Aug 12 '13 at 19:42
@amit: As you wrote yourself, you have only 8 cores, that reduces the factor in your case to maximal 8. By the way I doubt that either your hard drive or your file system can significant improve data transfer rate when you write a parallel implementation for your problem. And as I posted before all this does not change anything about the *O(n)* complexity. – MrSmith42 Aug 12 '13 at 20:33
@amit, I agree with MrSmith that even with these optimisations you can't change the **O(n)** complexity. Also as the OP mentions - Hadoop is not an option so you can't really consider the case of running ~10,000 faster on a distributed cluster. – dopplesoldner Aug 13 '13 at 09:08

score 0 · Answer 4 · edited May 23 '17 at 12:05

How to solve this problem with a good scalability:

The problem can be solved by 2 map-reduce steps:

Step 1:

map(word):
   emit(word,1)
Combine + Reduce(word,list<k>):
   emit(word,sum(list))

After this step you have a list of (word,#occurances)

Step 2:

map(word,k):
   emit(word,k):
Combine + Reduce(word,k): //not a list, because each word has only 1 entry.
   find top 10 and yield (word,k) for the top 10. //see appendix1 for details

In step 2 you must use a single reducer, The problem is still scalable, because it (the single reducer) has only 10*#mappers entries as input.

Solution for 300 MB file:

Practically, 300MB is not such a large file, so you can just create a histogram (on memory, with a tree/hash based map), and then output the top k values from it.

Using a map that supports concurrency, you can split the file into parts, and let each thread modify the when it needs. Note that if it cab actually be splitted efficiently is FS dependent, and sometimes a linear scan by one thread is mandatory.

Appendix1:
How to get top k:

Use a min heap and iterate the elements, the min heap will contain the highest K elements at all times.

Fill the heap with first k elements.
For each element e:
     If e > min.heap():
         remove the smallest element from the heap, and add e instead.

Also, more details in this thread

how to deal with a big text file(about 300M)

4 Answers4