Searching a list of words from a large file in python

Question

I am new python. I have a list of words and a very large file. I would like to delete the lines in the file that contain a word from the list of words.

The list of words is given as sorted and can be fed during initialization time. I am trying to find the best approach to solve this problem. I'm doing a linear search right now and it is taking too much time.

Any suggestions?

Ashwini Chaudhary · Answer 1 · 2012-07-13T18:49:34.473

3

you can use intersection from set theory to check whether the list of words and words from a line have anything in common.

list_of_words=[]
sett=set(list_of_words)
with open(inputfile) as f1,open(outputfile,'w') as f2:
    for line in f1:
        if len(set(line.split()).intersection(sett))>=1:
            pass
        else:
            f2.write(line)

edited Jul 13 '12 at 18:49

answered Jul 13 '12 at 18:03

Ashwini Chaudhary

244,495
58
464
504

That should be `open(outputfile, "w")`. Also, the condition is missing `len` to count the number of members; even shorter would be `set(line.split()) & sett`. – MRAB Jul 13 '12 at 18:47
@MRAB big thanks! I totally forgot to write those. and I prefer `intersection()` instead of `&` as I always forget these symbols. :) – Ashwini Chaudhary Jul 13 '12 at 18:57

score 1 · Answer 2 · answered Jul 13 '12 at 19:24

1

If the source file contains only words separated by whitespace, you can use sets:

words = set(your_words_list)
for line in infile:
    if words.isdisjoint(line.split()):
        outfile.write(line)

Note that this doesn't handle punctuation, e.g. given words = ['foo', 'bar'] a line like foo, bar,stuff won't be removed. To handle this, you need regular expressions:

rr = r'\b(%s)\b' % '|'.join(your_words_list)
for line in infile:
    if not re.search(rr, line):
        outfile.write(line)

answered Jul 13 '12 at 19:24

georg

211,518
52
313
390

Will search cause performance problem assuming the file is huge in size?Set is operation is good but the punctuations won't be handled in that case.Let me know ur thoughts on this. – user1524206 Jul 14 '12 at 16:56

score 0 · Answer 3 · answered Jul 13 '12 at 18:01

0

The lines and words in the big file need to somehow be sorted, in which case you can implement binary search. It does not seem like they are so the best you can do is linear search by checking to see if each word in the list is in a given line.

answered Jul 13 '12 at 18:01

user1413793

9,057
7
30
42

score 0 · Answer 4 · answered Jul 13 '12 at 18:03

contents = file.read()
words = the_list.sort(key=len, reverse=True)
stripped_contents = re.replace(r'^.*(%s).*\n'%'|'.join(words),'',contents)

something like that should work... not sure if it will be faster than going through line by line

[edit] this is untested code and may need some slight tweaks

score 0 · Answer 5 · edited May 23 '17 at 12:20

You can not delete the lines in-place, you need to rewrite a second file. You may overwrite the old one afterwards (see shutil.copyfor this).

The rest reads like pseudo-code:

forbidden_words = set("these words shall not occur".split())

with open(inputfile) as infile, open(outputfile, 'w+') as outfile:
  outfile.writelines(line for line in infile
      if not any(word in forbidden_words for word in line.split()))

See this question for approaches how to get rid of punctuation-induced false-negatives.

Searching a list of words from a large file in python

5 Answers5

Linked