I was wondering - is there someway I can remove large number (100s of thousands) of text phrases in one pass from a big (18 GB) text file?
-
1This question http://stackoverflow.com/questions/3452832/remove-duplicate-rows-from-a-large-file-in-python may answer your question [1]: http://stackoverflow.com/questions/3452832/remove-duplicate-rows-from-a-large-file-in-python – CouncilScribe Nov 10 '11 at 09:22
-
1Here's a tip if it's an one-time job: write some code, and see how much MB/s it processes. If the total processing time is < 3 hours, stop improving it there, your time is probably more valuable. – orlp Nov 10 '11 at 09:26
-
Thank you. Actually, this may not be as simple in my opinion because the search set is big by itself. I guess in the worst case it would be O(m*n) but I am just curious if a better approach exists. I was looking at Aho-Corasick string searching but I wasn't sure if there are better ways of doing this. – Legend Nov 10 '11 at 09:29
-
-1 nightcracker. It's an interesting problem and while premature optimisation is the root of all evil, spending some time thinking about the problem before leaping into "writing code" will probably be more satisfying and a better investment of ones time in the long run rather than waiting for a "dumber" solution to run. – Noufal Ibrahim Nov 10 '11 at 12:32
-
If you job is programming, thinking and optimizing might be more expensive for the task at hand, but improve your coding skills in the long run, which might be more valuable. – hochl Jan 03 '12 at 13:18
4 Answers
Rabin-Karp is good for multiple substring searching, but I think your phrases would have to be the same length.
If they're of similar lengths, you might be able to search for subphrases of length (minimum length across all phrases) and then extend when you've found something.
And another thought I'm having is that you could extend this to use a small set of say q subphrase lengths, as appropriate given your search phrases. And you could modify Rabin-Karp to have q rolling hashes instead of one, with q sets of hashes. This would help if you can partition your phrases in q subsets which do have similar lengths.

- 18,195
- 4
- 41
- 71
You could construct a suffix tree from your list of phrases and walk your file using it. It will allow you to identify all the strings. This is often used for tagging stuff but you should be able to adapt it to remove strings as well.

- 71,383
- 13
- 135
- 169
I'm going to go out on a limb here and suggest you use AWK, because it is very fast for this kind of task.

- 14,591
- 3
- 40
- 65
Are those phrases the same? Like is it the same word you want to remove? Then maybe you can remove it using the 'in' keyword. checking each line using a while loop and removing all instances of the word from that line. Need more information on the problem though.

- 483
- 1
- 4
- 11