5

This was an interview question and concerns about the efficiency. When there is a very large file (in GB) something like a log file. How can we find the 10th occurrence of a word like 'error' or 'java' etc. from the end of file. I can only think of scanning through the entire file and then finding out occurrence in reverse order. But I don't think it is the right way to do it! (Coding preferably in C or Java)

I would also like to know another thing. When an interviewer specifically mentions that its a very large file, what are the factors that should be considered when we start writing the code (apart from keeping in mind the scanning is really costly affair)

templatetypedef
  • 362,284
  • 104
  • 897
  • 1,065
msumaithri
  • 368
  • 1
  • 4
  • 11
  • How about looking at file I/O functions to see if any let you read the file backwards, perhaps by first setting the read position at the end of the file? – Alexey Frunze Jan 16 '13 at 04:59
  • As to your 2nd paragraph, you might need to read in bits(10-100 mb chunks?), rather than load entire file in memory. Or it might have been his way of saying, "DONT READ FROM START"! – Karthik T Jan 16 '13 at 05:03
  • take a look for http://mattfleming.com/node/11 read a text file backwards – Sumit Singh Jan 16 '13 at 05:06

3 Answers3

4

To search a word in a large text, the Boyer Moore algorithm is extensively used.

Principle (see the link for a live example) : when starting the comparison at some place (index) in the file, if the first letter of the text being compared is not at all in the word being searched, there is no need to compare its other [wordLength - 1] characters with the text, and the index can move forward of the word length. If the letter is in the word, not here exactly, but shifted by a few chars, the comparison can also be shifted by a few chars etc...

  • depending on the word length and similarity with the text, the search may be accelerated a lot (up to naiveSearchTime / wordLength).

edit Since you search from the end of the file, the 1st letter of the word (not the last) is to be compared at first. E.g. Searching "space" in "2001 a space odyssey", word space 's' is to be compared with the odyssey first 'y'. Next comparison is the same 's' against the text space 'c'.
And finally, to find the nth occurrence, a simple counter (initialized to n) is decremented each time the word is found, when it reaches 0, that's it.

The algorithm is easy to understand and to implement. Ideal for interviews.

You may ask also if the file is to be searched only once or several times? If it is intended to be searched multiple times, you can suggest to index the words from the file. I.e. create in memory a structure that allows to find quickly if a word is in it, where, how many times etc... I like the Trie algorithm also easy to understand, and very fast (can be pretty memory greedy also depending on the text). Its complexity is O(wordLength).

--

When the interviewer mentions "very large file" there are many factors to be considered, like

  • search algorithm as above
  • can the text fit in memory? (for instance when processing all of it) Do I have to implement a file-seek algorithm (i.e. use only part of the file in memory at a time)
  • where is the file? Memory (fast), hard-disk (slower but at least local), remote (usually slower, connection issues, accesses to remote, firewalls, network speed etc..)
  • is the file compressed? (will take even more space once uncompressed)
  • is the file made of one file or several chunks?
  • Does it contain text or binary? If text, its language gives an indication on the probability of a letter appearance (eg in English the Y appears much more frequently than in French).
  • Offer to index the file words if relevant
  • Offer to create a simpler file from big-one (like removing repeated words etc...) in order to have something smaller that can be processed more easily

...

Déjà vu
  • 28,223
  • 6
  • 72
  • 100
1

There are two parts of the answer to this question. One is the algorithm used which can be any good string search algorithm (Boyer Moore / KMP / Trie etc). The other part is IO. To speed up things since you can't really read backwards from a file a good approach will be :

  1. allocate a chunk of memory say 10MB
  2. for (i = 1; (filesize - 10MB * i) >= 0; i++) {
  3. seek to (filesize - 10MB * i) and read 10MB into memory
  4. Search for the string in the current chunk backwards and increment a counter
  5. Stop when the counter gets to 10

This is a heavily IO oriented question and you can improve this approach using multithreaded systems or multiple machines wherein you can do the search and read from file to memory (i.e., steps 3 and 4) in parallel.

This is C++ code but you know how to do it in Java.

user1952500
  • 6,611
  • 3
  • 24
  • 37
  • Please tell me why the -1. – user1952500 Jan 16 '13 at 05:54
  • You should fix a couple of bugs: first, you never read the first bytes of the file, unless it happens to be an even multiple of 10MB, so if the 10th occurrence happens to be close to the beginning, you'll miss it. Second, you don't explain how to match occurrences which happen to span two consecutive 10MB chunks; a strict implementation of your algorithm would miss these. (Particularly tricky if you try to do the search in parallel, but if the file is all stored on a single disk, parallel reading is unlikely to help.) – rici Jan 16 '13 at 16:01
  • Yup, this was meant to be an indicative pseudocode. The spanning of chunks is something that I never considered. Good catch! I will need to add the word length and reread the chunks I presume (depending on the size of the word) – user1952500 Jan 16 '13 at 19:12
  • 1
    It's more subtle than that. Take a pattern which might have overlapping matches, like "church". Normally, when you look for the "second match", you start where the first match ended, so the second match of "church" in "churchurch church" would be the second word. Now what happens if the first chunk ends right after the `churchur`? The second chunk is restarted five characters earlier, so it sees `rchurch church` which it records as two matches, in addition to the one match from the end of the first chunk. (The same thing happens in reverse, but forwards is easier to describe.) – rici Jan 16 '13 at 22:27
  • Yes it would be an involved approach. The KMP has a way of preprocesing the search string and getting to the first potentially new location. We could modify and reuse that behavior somewhat. Also, this issue is inherent in any chunk-based processing whether forwards or backwards. In forwards say the search string could be aaaa and the file could be full of a's alone and we'd have to then move the chunk to 10MB - strlen(aaaa) + 1 (the increment as per KMP) and then continue. – user1952500 Jan 17 '13 at 00:48
0

Adding to the comment by @AlexeyFrunze, see the related post here: read file backwards (last line first). Perhaps, however, the interviewer was interested in a solution in which you read in normal forward direction to see how you would address the problem of limited memory.

Awesome post by @ring0, so I will only mention something about the problem of finding specifically the k-th word from the end where k is small like 10, in a really large file, which suggests that you should not memorize the entire file and then search backwards.

You can maintain a first-in-first-out buffer a.k.a. queue of size k in which you store positions of matches as you encounter them. As you are finding more and more matches, you will forget about the earlier ones. Given const int k = 10; and long match_pos[k]; initialized to zero, and count, you could address match_pos[count % k] = pos for queueing. Once you arrive at the end of the file,

if (count >= k)
{
    int kth_match_pos = match_pos[(count + 1) % k];
    // ...
}

checks the oldest entry in your buffer, so you can jump back n byte, where n is pos - kth_match_pos. If the relevent context is stored in the queue as well, no seek will be necessary.

Community
  • 1
  • 1
s.bandara
  • 5,636
  • 1
  • 21
  • 36
  • This is bad since you will read the whole file which can be a few GB as per the original post. Reading line by line is also bad since you don't know the length of the line and seeks are very expensive. Look at my solution above. – user1952500 Jan 16 '13 at 06:02
  • @ArunMK That's not correct. To limit myself to reading forward is in consideration of the interview situation. The solution you posted instead does not guarantee to find the 10th match. – s.bandara Jan 16 '13 at 06:09
  • It will find it in case you search the buffer backwards. – user1952500 Jan 16 '13 at 06:17
  • There is a loop for the 10MB, it goes from the end to the start but in chunks of 10MB. I guess whoever gave the -1 did not read the code properly. – user1952500 Jan 16 '13 at 06:26