How can I remove large number of phrases in one pass from a large text file?

Question

I was wondering - is there someway I can remove large number (100s of thousands) of text phrases in one pass from a big (18 GB) text file?

This question http://stackoverflow.com/questions/3452832/remove-duplicate-rows-from-a-large-file-in-python may answer your question [1]: http://stackoverflow.com/questions/3452832/remove-duplicate-rows-from-a-large-file-in-python — CouncilScribe, Nov 10 '11 at 09:22
Here's a tip if it's an one-time job: write some code, and see how much MB/s it processes. If the total processing time is < 3 hours, stop improving it there, your time is probably more valuable. — orlp, Nov 10 '11 at 09:26
Thank you. Actually, this may not be as simple in my opinion because the search set is big by itself. I guess in the worst case it would be O(m*n) but I am just curious if a better approach exists. I was looking at Aho-Corasick string searching but I wasn't sure if there are better ways of doing this. — Legend, Nov 10 '11 at 09:29
-1 nightcracker. It's an interesting problem and while premature optimisation is the root of all evil, spending some time thinking about the problem before leaping into "writing code" will probably be more satisfying and a better investment of ones time in the long run rather than waiting for a "dumber" solution to run. — Noufal Ibrahim, Nov 10 '11 at 12:32
If you job is programming, thinking and optimizing might be more expensive for the task at hand, but improve your coding skills in the long run, which might be more valuable. — hochl, Jan 03 '12 at 13:18

Vlad · Answer 1 · 2011-11-10T11:48:10.313

Rabin-Karp is good for multiple substring searching, but I think your phrases would have to be the same length.

If they're of similar lengths, you might be able to search for subphrases of length (minimum length across all phrases) and then extend when you've found something.

And another thought I'm having is that you could extend this to use a small set of say q subphrase lengths, as appropriate given your search phrases. And you could modify Rabin-Karp to have q rolling hashes instead of one, with q sets of hashes. This would help if you can partition your phrases in q subsets which do have similar lengths.

score 0 · Answer 2 · answered Nov 10 '11 at 11:43

0

You could construct a suffix tree from your list of phrases and walk your file using it. It will allow you to identify all the strings. This is often used for tagging stuff but you should be able to adapt it to remove strings as well.

answered Nov 10 '11 at 11:43

Noufal Ibrahim

71,383
13
135
169

score 0 · Answer 3 · answered Nov 10 '11 at 12:25

0

I'm going to go out on a limb here and suggest you use AWK, because it is very fast for this kind of task.

answered Nov 10 '11 at 12:25

Gustav Bertram

14,591
3
40
65

score 0 · Answer 4 · answered Jan 03 '12 at 13:02

Are those phrases the same? Like is it the same word you want to remove? Then maybe you can remove it using the 'in' keyword. checking each line using a while loop and removing all instances of the word from that line. Need more information on the problem though.

How can I remove large number of phrases in one pass from a large text file?

4 Answers4