3

I am new python. I have a list of words and a very large file. I would like to delete the lines in the file that contain a word from the list of words.

The list of words is given as sorted and can be fed during initialization time. I am trying to find the best approach to solve this problem. I'm doing a linear search right now and it is taking too much time.

Any suggestions?

Ason
  • 509
  • 2
  • 9
  • 25

5 Answers5

3

you can use intersection from set theory to check whether the list of words and words from a line have anything in common.

list_of_words=[]
sett=set(list_of_words)
with open(inputfile) as f1,open(outputfile,'w') as f2:
    for line in f1:
        if len(set(line.split()).intersection(sett))>=1:
            pass
        else:
            f2.write(line)
Ashwini Chaudhary
  • 244,495
  • 58
  • 464
  • 504
  • That should be `open(outputfile, "w")`. Also, the condition is missing `len` to count the number of members; even shorter would be `set(line.split()) & sett`. – MRAB Jul 13 '12 at 18:47
  • @MRAB big thanks! I totally forgot to write those. and I prefer `intersection()` instead of `&` as I always forget these symbols. :) – Ashwini Chaudhary Jul 13 '12 at 18:57
1

If the source file contains only words separated by whitespace, you can use sets:

words = set(your_words_list)
for line in infile:
    if words.isdisjoint(line.split()):
        outfile.write(line)

Note that this doesn't handle punctuation, e.g. given words = ['foo', 'bar'] a line like foo, bar,stuff won't be removed. To handle this, you need regular expressions:

rr = r'\b(%s)\b' % '|'.join(your_words_list)
for line in infile:
    if not re.search(rr, line):
        outfile.write(line)
georg
  • 211,518
  • 52
  • 313
  • 390
  • Will search cause performance problem assuming the file is huge in size?Set is operation is good but the punctuations won't be handled in that case.Let me know ur thoughts on this. – user1524206 Jul 14 '12 at 16:56
0

The lines and words in the big file need to somehow be sorted, in which case you can implement binary search. It does not seem like they are so the best you can do is linear search by checking to see if each word in the list is in a given line.

user1413793
  • 9,057
  • 7
  • 30
  • 42
0
contents = file.read()
words = the_list.sort(key=len, reverse=True)
stripped_contents = re.replace(r'^.*(%s).*\n'%'|'.join(words),'',contents)

something like that should work... not sure if it will be faster than going through line by line

[edit] this is untested code and may need some slight tweaks

Joran Beasley
  • 110,522
  • 12
  • 160
  • 179
0

You can not delete the lines in-place, you need to rewrite a second file. You may overwrite the old one afterwards (see shutil.copyfor this).

The rest reads like pseudo-code:

forbidden_words = set("these words shall not occur".split())

with open(inputfile) as infile, open(outputfile, 'w+') as outfile:
  outfile.writelines(line for line in infile
      if not any(word in forbidden_words for word in line.split()))

See this question for approaches how to get rid of punctuation-induced false-negatives.

Community
  • 1
  • 1
moooeeeep
  • 31,622
  • 22
  • 98
  • 187