python: remove lines containing a word in a list

Question

I am working on a script in python that I can't seem to get right. It uses two inputs:

data file
stop file

The data-file is composed of 4 tab-separated columns which are sorted. The stop file is composed of a list of words also sorted.

The objective of the script is:

If a string in Column 1 of the data file matches a string in the "stop file," the entire line is deleted.

Here is an example of the data file:

abandonment-n   after+n-the+n-a-j   stop-n  1
abandonment-n   against+n-the+ns    leave-n 1
cake-n  against+n-the+vg    rest-v  1
abandonment-n   as+n-a+vd   require-v   1
abandonment-n   as+n-a-j+vg-up  use-v   1

Here is an example of the stop file:

apple-n
banana-n
cake-n
pigeon-n

Here is the code that I have so far:

with open("input1", "rb") as oIndexFile:
        for line in oIndexFile: 
            lemma = line.split()
            #print lemma

with open ("input2", "rb") as oSenseFile:
    with open("output", "wb") as oOutFile:
        for line in oSenseFile:
            concept, slot, filler, freq = line.split()
            nounsInterest = [concept, slot, filler, freq]
            #print concept
            if concept != lemma:
                outstring = '\t'.join(nounsInterest)
                oOutFile.write(outstring + '\n')
            else: 
                pass

Where the desired output is the following:

abandonment-n   after+n-the+n-a-j-stop-n    1
abandonment-n   against+n-the+ns-leave-n    1
abandonment-n   as+n-a+vd-require-v 1
abandonment-n   as+n-a-j+vg-up-use-v    1

Any insight?

As of now the output that I am getting is the following, which is basically just a print out of what I have been doing:

abandonment-n   after+n-the+n-a-j   stop-n  1
abandonment-n   against+n-the+ns    leave-n 1
cake-n  against+n-the+vg    rest-v  1
abandonment-n   as+n-a+vd   require-v   1
abandonment-n   as+n-a-j+vg-up  use-v   1

*** Some of the things that I have tried -- and are still not working is:

instead of if concept != lemma: I first tried if concept not in lemma:

which produces the same output as mentioned before.

I also have the doubt that the function is not calling the first input file, but even with incorporating it in the code: as such:

with open ("input2", "rb") as oSenseFile:
    with open("tinput1", "rb") as oIndexFile:
        for line in oIndexFile: 
            lemma = line.split()
            with open("out", "wb") as oOutFile:
                for line in oSenseFile:
                    concept, slot, filler, freq = line.split()
                    nounsInterest = [concept, slot, filler, freq]
                    if concept not in lemma:
                        outstring = '\t'.join(nounsInterest)
                        oOutFile.write(outstring + '\n')
                    else: 
                        pass

which produces a blank output file.

I have also tried a different approach as found here:

filename = "input1.txt" 
filename2 = "input2.txt"
filename3 = "output1"

def fixup(filename): 
    fin1 = open(filename) 
    fin2 = open(filename2, "r")
    fout = open(filename3, "w") 
    for word in filename: 
        words = word.split()
    for line in filename2:
        concept, slot, filler, freq = line.split()
        nounsInterest = [concept, slot, filler, freq]
        if True in [concept in line for word in toRemove]:
            pass
        else:
            outstring = '\t'.join(nounsInterest)
            fout.write(outstring + '\n')
    fin1.close() 
    fin2.close() 
    fout.close()

which has been adapted from here, with no success. In this case, the output is not produced at all.

Can someone point me in the direction to where I am going wrong with solving this task? Although the sample files are small, I must run this on a large file. Thank you for any assistance.

each `line.split()` generates a new list. In your case, lemma is `["pigeon"]` after the loop. And that's why the output is not as expected. — flyingfoxlee, Nov 13 '13 at 10:25
possible duplicate of [Searching a list of words from a large file in python](http://stackoverflow.com/questions/11475796/searching-a-list-of-words-from-a-large-file-in-python) — moooeeeep, Nov 13 '13 at 10:33
@moooeeeep I checked that out and incorporated some of the insight from that post -- but still failed to achieve desired output. Thanks for the info though! — owwoow14, Nov 13 '13 at 11:07

score 4 · Accepted Answer · answered Nov 13 '13 at 10:25

I think you're trying to do something like this

with open('input1', 'rb') as indexfile:
    lemma = {x.strip() for x in indexfile}

with open('input2', 'rb') as sensefile, open('output', 'wb') as outfile:
    for line in sensefile:
        nouns_interest = concept, slot, filler, freq = line.split()
        if concept not in lemma:
            outfile.write('\t'.join(nouns_interest) + '\n')

Your desired output seems to be putting a hyphen between slot and filler so you may be wanting to use

            outfile.write('{}\t{}-{}\t{}\n'.format(*nouns_interest))

aIKid · Answer 2 · 2013-11-13T10:40:44.703

I haven't checked your logic yet, but you're overwriting lemma for each line you have there. Perhaps append it to a list?

lemma = []
for line in oIndexFile:
    lemma.append(line.strip())  #strips everything except the text

Or, as just suggested by @gnibbler, you can use set for slight efficiency reasons:

lemma = set()
for line in oIndexFile:
    lemma.add(line.strip())

Edit: Looks like you don't want to split it, but strip the new line character. And yes, your logic was almost right

And this is what the second part should look like:

with open ("data_php.txt", "rb") as oSenseFile:
    with open("out_FILTER_LINES", "wb") as oOutFile:
        for line in oSenseFile:
            concept, slot, filler, freq = line.split()
            nounsInterest = [concept, slot, filler, freq]
            #print concept
            if concept not in lemma: #check if the concept exists in lemma
                outstring = '\t'.join(nounsInterest)
                oOutFile.write(outstring + '\n')
            else: 
                pass

Better to use a `set` for `lemma` – John La Rooy Nov 13 '13 at 10:35 — John La Rooy, Nov 13 '13 at 10:35

score 1 · Answer 3 · answered Nov 13 '13 at 10:44

1

If you're sure that lines in data file is not started with blank space, then the we don't need to split the line. Here is a slight tweak of @gnibbler 's answer.

with open('input1', 'rb') as indexfile:
    lemma = {x.strip() for x in indexfile}

with open('input2', 'rb') as sensefile, open('output', 'wb') as outfile:
    for line in sensefile:
        if not any([line.startswith(x) for x in lemma]):
            outfile.write(line)

answered Nov 13 '13 at 10:44

flyingfoxlee

1,764
1
19
29

1

The key point of @gnibbler 's answer is to use `in set`, which is efficient. – georg Nov 13 '13 at 10:52
On the sample data: I ran a time stamp on both @flyingfoxlee and @gnibbler answers and @gnibbler 's is slightly faster. '#python flyingfoxlee.py `#starting: 2013-11-13 11:50:43.533743 #Finish 2013-11-13 11:50:43.534602 #Difference: 0.000859` vs. `#python gnibbler.py #starting: 2013-11-13 11:51:21.671065 #Finish: 2013-11-13 11:51:21.671921 #Difference: 0.000856` This is important as I will essentially use this on quite a large file. I am doing some more tests on bigger data. – owwoow14 Nov 13 '13 at 11:02
1

@gnibbler's answer is great, here I just want to supply an alternative answer in case the data file is not start with blank space. I'm not sure which one is more efficient. – flyingfoxlee Nov 13 '13 at 11:03

python: remove lines containing a word in a list

3 Answers3